Function crashes when using _mm_load_pd - c++

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?

Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.

Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.

If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

I try to re-write the following uint64_t 2x2 matrix multiplication with AVX-512 instructions, but GCC 10.3 does not found _mm256_rem_epu64 intrinsic.
#include <cstdint>
#include <immintrin.h>
constexpr uint32_t LAST_9_DIGITS_DIVIDER = 1000000000;
void multiply(uint64_t f[2][2], uint64_t m[2][2])
{
uint64_t x = (f[0][0] * m[0][0] + f[0][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t y = (f[0][0] * m[0][1] + f[0][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
uint64_t z = (f[1][0] * m[0][0] + f[1][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t w = (f[1][0] * m[0][1] + f[1][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
f[0][0] = x;
f[0][1] = y;
f[1][0] = z;
f[1][1] = w;
}
void multiply_simd(uint64_t f[2][2], uint64_t m[2][2])
{
__m256i v1 = _mm256_set_epi64x(f[0][0], f[0][0], f[1][0], f[1][0]);
__m256i v2 = _mm256_set_epi64x(m[0][0], m[0][1], m[0][0], m[0][1]);
__m256i v3 = _mm256_mullo_epi64(v1, v2);
__m256i v4 = _mm256_set_epi64x(f[0][1], f[0][1], f[1][1], f[1][1]);
__m256i v5 = _mm256_set_epi64x(m[1][0], m[1][1], m[1][0], m[1][1]);
__m256i v6 = _mm256_mullo_epi64(v4, v5);
__m256i v7 = _mm256_add_epi64(v3, v6);
__m256i div = _mm256_set1_epi64x(LAST_9_DIGITS_DIVIDER);
__m256i v8 = _mm256_rem_epu64(v7, div);
_mm256_store_epi64(f, v8);
}
Is it possible somehow to enable _mm256_rem_epu64 or if not, some other way to calculate the reminder with SIMD instructions?
As Peter Cordes mentioned in the comments, _mm256_rem_epu64 is an SVML function. Most compilers don't support SVML; AFAIK really only ICC does, but clang can be configured to use it too.
The only other implementation of SVML I'm aware of is in one of my projects, SIMDe. In this case, since you're using GCC 10.3, the implementation of _mm256_rem_epu64 will use vector extensions, so the code from SIMDe is going to be basically the same as something like:
#include <immintrin.h>
#include <stdint.h>
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i
foo_mm256_rem_epu64(__m256i a, __m256i b) {
return (__m256i) (((u64x4) a) % ((u64x4) b));
}
In this case, both GCC and clang will scalarize the operation (see Compiler Explorer), so performance is going to be pretty bad, especially considering how slow the div instruction is.
That said, since you're using a compile-time constant, the compiler should be able to replace the division with a multiplication and a shift, so performance will be better, but we can squeeze out some more by using libdivide.
Libdivide usually computes the the magic value at runtime, but the libdivide_u64_t structure is very simple and we can just skip the libdivide_u64_gen step and provide the struct at compile time:
__m256i div_by_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return libdivide_u64_do_vec256(a, &d);
}
Now, if you can use AVX-512VL + AVX-512DQ there is a 64-bit multiplication function (_mm256_mullo_epi64). If you can use that it's probably the right way to go:
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return
_mm256_sub_epi64(
a,
_mm256_mullo_epi64(
libdivide_u64_do_vec256(a, &d),
_mm256_set1_epi64x(1000000000)
)
);
}
(or on Compiler Explorer, with LLVM-MCA)
If you don't have AVX-512DQ+VL, you'll probably want to fall back on vector extensions again:
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
u64x4 one_billion = { 1000000000, 1000000000, 1000000000, 1000000000 };
return (__m256i) (
(
(u64x4) a) -
(((u64x4) libdivide_u64_do_vec256(a, &d)) * one_billion
)
);
}
(on Compiler Explorer)
All this is untested, but assuming I haven't made any stupid mistakes it should be relatively snappy.
If you really want to get rid of the libdivide dependency you could perform those operations yourself, but I don't really see any good reason not to use libdivide so I'll leave that as an exercise for someone else.

Gcc misoptimises sse function

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:
void dodgy_function(
const short* lows,
const short* highs,
short* mins,
short* maxs,
int its
)
{
__m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
__m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
for (int i = 0; i < its; ++i) {
reinterpret_cast<short*>(v00)[i] = lows[i];
reinterpret_cast<short*>(v10)[i] = highs[i];
}
reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];
__m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i min[2];
__m128i max[2];
min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));
min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1]));
max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));
reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);
reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}
Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link
Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?
As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2
Any ideas?
The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).
__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
The only standard blessed way to do type punning is with memcpy:
memcpy(v00, lows, its * sizeof(short));
memcpy(v10, highs, its * sizeof(short));
memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1, sizeof(short));
https://godbolt.org/z/f63q7x
I prefer just using aligned memory of the correct type directly:
alignas(16) short v00[16];
alignas(16) short v10[16];
auto mv00 = reinterpret_cast<__m128i*>(v00);
auto mv10 = reinterpret_cast<__m128i*>(v10);
_mm_store_si128(mv00, _mm_setzero_si128());
_mm_store_si128(mv10, _mm_setzero_si128());
_mm_store_si128(mv00 + 1, _mm_setzero_si128());
_mm_store_si128(mv10 + 1, _mm_setzero_si128());
for (int i = 0; i < its; ++i) {
v00[i] = lows[i];
v10[i] = highs[i];
}
v00[its] = v00[its - 1];
v10[its] = v10[its - 1];
https://godbolt.org/z/bfanne
I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.
As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.

Vectorizing sparse matrix vector product with Compressed Sparse Row SegFault [duplicate]

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,

Complex numbers passed by-value from C++ to C does not seem to work on powerpc

When I'm passing a complex float(complex.h) from a c++ caller to a c library, the value does not pass correctly when running on a 32 bit power pc. I was using two different open source software libraries when I detected this problem. I've isolated it down to the boundry of when C++ is passing a complex value type to a pure C type function. I wrote up some simple code to demonstrate it.
#ifndef MMYLIB_3A8726C1_H
#define MMYLIB_3A8726C1_H
typedef struct aComplexStructure {
float r;
float i;
} myComplex_t;
#ifdef __cplusplus
#include <complex>
extern "C" {
void procWithComplex(float a, std::complex<float> *pb, std::complex<float> c, float d);
void procWithStruct(float a, myComplex_t *pb, myComplex_t c, float d);
}
#else /* __cplusplus */
#include <complex.h>
void procWithComplex(float a, float complex *pb, float complex c, float d);
void procWithStruct(float a, myComplex_t *pb, myComplex_t c, float d);
#endif
#endif /* MYLIB_3A8726C1_H */
The source C file is as follows
#include <stdio.h>
#include "myLib.h"
void procWithComplex(float a, complex float * pb, complex float c, float d)
{
printf("a=%f\n", a);
printf("b=%f + %fi\n", creal(*pb), cimag(*pb));
printf("c=%f + %fi\n", creal(c), cimag(c));
printf("d=%f\n", d);
}
void procWithStruct(float a, myComplex_t* pb, myComplex_t c, float d)
{
printf("a=%f\n", a);
printf("b=%f + %fi\n", pb->r, pb->i);
printf("c=%f + %fi\n", c.r, c.i);
printf("d=%f\n", d);
}
The calling C++ program is as follows
#include <iostream>
#include "myLib.h"
int main()
{
float a = 1.2;
std::complex<float> b = 3.4 + 3.4I;
std::complex<float> c = 5.6 + 5.6I;
float d = 9.876;
myComplex_t b_s, c_s;
b_s.r = b.real();
b_s.i = b.imag();
c_s.r = c.real();
c_s.i = c.imag();
std::cout << "a=" << a << std::endl;
std::cout << "b=" << b << std::endl;
std::cout << "c=" << c << std::endl;
std::cout << "d=" << d << std::endl << std::endl;
// c is a 64 bit structure being passed by value.
// on my 32 bit embedded powerpc platform, it is being
// passed by reference, but the underlying C library is
// reading it by value.
procWithComplex(a, &b, c, d);
std::cout << std::endl;
// This is only here to demonstrate that a 64 bit value field
// does pass through the C++ to C boundry
procWithStruct(a, &b_s, c_s, d);
return 0;
}
Normally I would expect the output to be
a=1.2
b=(3.4,3.4)
c=(5.6,5.6)
d=9.876
a=1.200000
b=3.400000 + 3.400000i
c=5.600000 + 5.600000i
d=9.876000
a=1.200000
b=3.400000 + 3.400000i
c=5.600000 + 5.600000i
d=9.876000
But when I run the source on an embedded power pc machine I get output that shows that the value type for complex is not being passed properly.
a=1.2
b=(3.4,3.4)
c=(5.6,5.6)
d=9.876
a=1.200000
b=3.400000 + 3.400000i
c=-0.000000 + 9.876000i
d=0.000000
a=1.200000
b=3.400000 + 3.400000i
c=5.600000 + 5.600000i
d=9.876000
I checked the sizeof the paramaters from gdb and from both the calling and the function frame the sizes are 4 bytes, 4 bytes, 8 bytes and 4 bytes, for the float, complex float pointer, complex float, and float.
I realize I can just change the complex value parameter as a pointer, or my own struct when crossing the c++ to c boundry, but I want to know why I can't pass a complex value type from c++ to c on a power pc.
I created another example only this time I dumped some of the assembly as well as the register values.
int x = 22;
std::complex<float> y = 55 + 88I;
int z = 77;
void simpleProc(int x, complex float y, int z)
Right before call where the parameters are passed in.
x = 22
y = {_M_value = 55 + 88 * I}
Looking at raw data *(int*)&y = 1113325568
z = 77
This should be assembly code where it saves the the return address and saves the paramters to pass into the routine.
x0x10000b78 <main()+824> lwz r9,40(r31)
x0x10000b7c <main()+828> stw r9,72(r31)
x0x10000b80 <main()+832> lwz r9,44(r31)
x0x10000b84 <main()+836> stw r9,76(r31)
x0x10000b88 <main()+840> addi r9,r31,72
x0x10000b8c <main()+844> lwz r3,16(r31)
x0x10000b90 <main()+848> mr r4,r9
x0x10000b94 <main()+852> lwz r5,20(r31)
x0x10000b98 <main()+856> bl 0x10000f88 <simpleProc>
Looking at the assembly right after the branch :)
x0x10000f88 <simpleProc> stwu r1,-48(r1)
x0x10000f8c <simpleProc+4> mflr r0
x0x10000f90 <simpleProc+8> stw r0,52(r1)
x0x10000f94 <simpleProc+12> stw r29,36(r1)
x0x10000f98 <simpleProc+16> stw r30,40(r1)
x0x10000f9c <simpleProc+20> stw r31,44(r1)
x0x10000fa0 <simpleProc+24> mr r31,r1
x0x10000fa4 <simpleProc+28> stw r3,8(r31)
x0x10000fa8 <simpleProc+32> stw r5,12(r31)
x0x10000fac <simpleProc+36> stw r6,16(r31)
x0x10000fb0 <simpleProc+40> stw r7,20(r31)
x0x10000fb4 <simpleProc+44> lis r9,4096
These are the value once we are completely in the routine (after variable values are assigned.
x = 22
y = 1.07899982e-43 + 0 * I
z = 265134296
$r3 = 22
$r4 = 0x9ffff938
*(int*)$r4 = 1113325568
$r5 = 77
*(int*)(&y) = 77
My laymans view is it looks like the C++ is passing the complex value type as a reference or pointer type? but C is treating it like a value type? So is this a problem with gcc on the power pc? I am using gcc4.7.1. I am in the process of building gcc4.9.3 as a cross compiler on another machine. I will update this post either way once I get the output from a newer compiler working.
Having problems getting the cross compiler working, but looking at the memory dump of the original problem, it does show that on the power pc platform, the complex value is not being passed by value. I put the example of the struct here to show that a 64 bit value can be pass through by value, on a 32 bit machine.
Your code causes undefined behaviour. In the C++ unit the function is declared as:
extern "C" void procWithComplex(float a, std::complex<float> *pb, std::complex<float> c, float d);
but the function body is:
void procWithComplex(float a, complex float * pb, complex float c, float d)
which does not match.
To help the compiler diagnose this error you should avoid using the preprocessor to switch in different prototypes for the same function.
To avoid this error you need to have the function prototype only use types which are valid in both C and C++. Such as you did in the myComplex_t example.
We ended up using a cross compiler vs the native compiler on the dev board to create the binaries. Apparently complex numbers are not handled properly across the C to C++ boundary for the native compiler we used.
All the changes suggested were tried and failed, but they were all still good suggestions. It helped confirm our thoughts that it might be a compiler problem which enabled us to try using a cross compiler. Thanks all!

What is the equivalent of v4sf and __attribute__ in Visual Studio C++?

typedef float v4sf __attribute__ ((mode(V4SF)));
This is in GCC. Anyone knows the equivalence syntax?
VS 2010 will show __attribute__ has no storage class of this type, and mode is not defined.
I searched on the Internet and it said
Equivalent to __attribute__( aligned( size ) ) in GCC
It is helpful
for former unix developers or people writing code that works on
multiple platforms that in GCC you achieve the same results using
attribute( aligned( ... ) )
See here for more information:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Type-Attributes.html#Type-Attributes
The full GCC code is here: http://pastebin.com/bKkTTmH1
If you're looking for the alignment directive in VC++ it's __declspec(align(16)). (or whatever you want the alignment to be)
And example usage is this:
__declspec(align(16)) float x[] = {1.,2.,3.,4.};
http://msdn.microsoft.com/en-us/library/83ythb65.aspx
Note that both attribute (in GCC) and __declspec (in VC++) are compiler-specific extensions.
EDIT :
Now that I take a second look at the code, it's gonna take more work than just replacing the __attribute__ line with the VC++ equivalent to get it to compile in VC++.
VC++ doesn't have any if these macros/functions that you are using:
__builtin_ia32_xorps
__builtin_ia32_loadups
__builtin_ia32_mulps
__builtin_ia32_addps
__builtin_ia32_storeups
You're better off just replacing all of those with SSE intrinsics - which will work on both GCC and VC++.
Here's the code converted to intrinsics:
float *mv_mult(float mat[SIZE][SIZE], float vec[SIZE]) {
static float ret[SIZE];
float temp[4];
int i, j;
__m128 m, v, r;
for (i = 0; i < SIZE; i++) {
r = _mm_xor_ps(r, r);
for (j = 0; j < SIZE; j += 4) {
m = _mm_loadu_ps(&mat[i][j]);
v = _mm_loadu_ps(&vec[j]);
v = _mm_mul_ps(m, v);
r = _mm_add_ps(r, v);
}
_mm_storeu_ps(temp, r);
ret[i] = temp[0] + temp[1] + temp[2] + temp[3];
}
return ret;
}
V4SF and friends have to do with GCC "vector extensions":
http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Vector-Extensions.html#Vector%20Extensions
http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/X86-Built-in-Functions.html
I'm not sure how much - if any of this stuff - is supported in MSVS/MSVC. Here are a few links:
http://www.codeproject.com/KB/recipes/sseintro.aspx?msg=643444
http://msdn.microsoft.com/en-us/library/y0dh78ez%28v=vs.80%29.aspx
http://msdn.microsoft.com/en-us/library/01fth20w.aspx