sse C++ memory commands - c++

SSE asm has SQRTPS command.
SQRTPS command have 2 versions:
SQRTPS xmm1, xmm2
SQRTPS xmm1, m128
gcc/clang/vs (all) compilers have helper function _mm_sqrt_ps.
But _mm_sqrt_ps can work only with preloaded xmm (with _mm_set_ps / _mm_load_ps).
From Visual Studio, for example:
http://msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx
What I expect:
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
asm{
sqrtps xmm0, data // DIRECTLY FROM MEMORY
movaps result, xmm0
}
What I have (in C):
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_load_ps(&data) // or _mm_set_ps
xmm = _mm_sqrt_ps(xmm);
_mm_store_ps(&result[0], xmm);
(in asm):
movaps xmm1, data
sqrtps xmm0, xmm1 // FROM REGISTER
movaps result, xmm0
In other words, I would like to see something like this:
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction)
_mm_store_ps(&result[0], xmm);

Quick research: I made the following file, called mysqrt.cpp:
#include <pmmintrin.h>
extern "C" __m128 MySqrt(__m128* a) {
return _mm_sqrt_ps(a[1]);
}
Trying gcc, namely g++4.8 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s:
_MySqrt:
LFB526:
sqrtps 16(%rdi), %xmm0
ret
Clang (clang++3.6 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s):
_MySqrt: ## #MySqrt
.cfi_startproc
## BB#0: ## %entry
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
sqrtps 16(%rdi), %xmm0
popq %rbp
retq
Don't know about VS, but at least both gcc and clang seem to produce memory version of sqrtps if needed.
UPDATE Example of function usage:
#include <iostream>
#include <pmmintrin.h>
extern "C" __m128 MySqrt(__m128* a);
int main() {
__m128 x[2];
x[1] = _mm_set_ps1(4);
__m128 y = MySqrt(x);
std::cout << y[0] << std::endl;
}
// output:
2
UPDATE 2: Regarding your code, you should just do:
auto xmm = _mm_sqrt_ps(*reinterpret_cast<__m128*>(data));
And of course it will be at your own risk, you should guarantee that data contains valid __m128 and is properly aligned.

I think you misunderstood the interface provided by the primitive _mm_sqrt_ps(__m128). The argument type here can be a variable hold in memory or in register. The extension type __m128 acts like any normal builtin type, e.g. double, and is not bound to an xmm register but can also be stored in memory.
EDIT Unless you use asm, the compiler determines if and when a variable is loaded into register or left in memory. So, in the following code snippet
__m128 foo(const __m128 x, const __m128*y, std::size_t n)
{
__m128 result = _mm_set_ps(1.0);
while(n--)
result = _mm_mul_ps(result,_mm_add_ps(x,_mm_sqrt_ps(*y++)));
return result;
}
it's up to the compiler which variables are stored in register. I would think that the compiler puts x and result into xmm registers, but gets *y directly from memory.

The answer to your question is that you can't control this , at least for aligned loads, with intrinsics. It's up to the compiler to decide if it uses SQRTPS xmm1, xmm2 or SQRTPS xmm1, m128. If you want to be 100% certain then you have to write it in assembly. This is one of the deficiencies of intrinsics (at least as they are currently implemented) in my opinion.
Some code can help explain this.
We can get GCC (64-bit with -O3) to generate both version using aligned and unaligned loads
float x[4], y[4]
__m128 x4 = _mm_loadu_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);
This gives (with Intel syntax)
movups xmm0, XMMWORD PTR [rdx]
sqrtps xmm0, xmm0
However, if we do an aligned load we get the other form
float x[4], y[4]
__m128 x4 = _mm_load_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);
This combines the load and square root into one instruction
sqrtps xmm0, XMMWORD PTR [rax]
Most people would say "trust the compiler." I disagree. If you're using intrinsics then it should be assumed that YOU know what you're doing and NOT the compiler. Here is an example difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp where GCC chose one form and MSVC chose the other form (for multiplication instead of the sqrt) and it made a difference in performance.
So once again, if you're using aligned loads, you can only pray that the compiler does what you want. And then maybe on the next version of the compiler it does something different...

Related

how to minimize overhead loading double into simd regsiters working with scalar SIMD intrinsics

Using gcc 7.2 at godbolt.org I can see the following code is translated in assembler quite optimally. I see 1 load, 1 addition and 1 store.
#include <immintrin.h>
__attribute__((alwaysinline)) double foo(double x, double y)
{
return x+y;
}
void usefoo(double x, double *y, double *z)
{
*z = foo(x, *y);
}
which results in:
usefoo(double, double*, double*):
addsd xmm0, QWORD PTR [rdi]
movsd QWORD PTR [rsi], xmm0
ret
However, if I try and achieve the same using intrinsics and template with the code below, I can see some overhead is added. In particular, what is the point of the instruction: movq xmm0, xmm0 ?
#include <immintrin.h>
__attribute__((alwaysinline)) double foo(double x, double y)
{
return _mm_cvtsd_f64(_mm_add_sd(__m128d{x}, __m128d{y}));
}
void usefoo(double x, double *y, double *z)
{
*z = foo(x, *y);
}
which results in:
usefoo(double, double*, double*):
movq xmm1, QWORD PTR [rdi]
movq xmm0, xmm0
addsd xmm0, xmm1
movlpd QWORD PTR [rsi], xmm0
ret
How can I achieve with scalar intrinsics a code equivalent to what the compiler would generate otherwise?
If you wonder why I may want to do that, think about replacing + with <=: if I write x<y the compiler converts the results to bool, while the intrinsic would keep it as a double bitmask. Hence for my use case, writing x<y is not an option. However using + was simple enough to illustrate the question.
The "extraneous" movq is clearing the second element in the __m128d, as you requested by the list-initialization __m128d{x}.
When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all 0s.
Remember that when fewer initializers are supplied than there are members, all remaining members are value-initialized (to zero).
I would expect a higher level of optimization to see that the second element is never used, and remove the extraneous instruction. On the other hand, even though unused, the second value cannot be allowed to trap during the addition operation, and clearing it explicitly may be the safest way to ensure it does not.

How to write portable simd code for complex multiplicative reduction

I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:
#include <complex.h>
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
n will be at most 50.
Gcc can't auto-vectorize complex multiplication but, as I am happy to assume the gcc compiler and if I knew I wanted to target sse3 I could follow How to enable sse3 autovectorization in gcc and write:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x[], int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
This indeed produces fast vectorized assembly code using gcc. Although you still need to pad your input to a multiple of 4. The assembly you get is:
.L3:
vmovaps xmm0, XMMWORD PTR 16[rsi]
add rsi, 32
vmulps xmm1, xmm0, xmm2
vmulps xmm0, xmm0, xmm3
vfmsubps xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
vmovaps xmm3, xmm1
vfmaddps xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
cmp rdx, rsi
jne .L3
However, it is designed for the exact simd instruction set and is not optimal for avx2 or avx512 for example for which you need to change the code.
How can you write C or C++ code for which gcc will produce optimal
code when compiled for any of sse, avx2 or avx512? That is, do you always have to write separate functions by hand for each different width of SIMD register?
Are there any open source libraries that make this easier?
Here would be an example using the Eigen library:
#include <Eigen/Core>
std::complex<float> f(const std::complex<float> *x, int n)
{
return Eigen::VectorXcf::Map(x, n).prod();
}
If you compile this with clang or g++ and sse or avx enabled (and -O2), you should get fairly decent machine code. It also works for some other architectures like Altivec or NEON. If you know that the first entry of x is aligned, you can use MapAligned instead of Map.
You get even better code, if you happen to know the size of your vector at compile time using this:
template<int n>
std::complex<float> f(const std::complex<float> *x)
{
return Eigen::Matrix<std::complex<float>, n, 1> >::MapAligned(x).prod();
}
Note: The functions above directly correspond to the function f of the OP.
However, as #PeterCordes pointed out, it is generally bad to store complex numbers interleaved, since this will require lots of shuffling for multiplication. Instead, one should store real and imaginary parts in a way that they can be directly loaded one packet at once.
Edit/Addendum: To implement a structure-of-arrays like complex multiplication, you can actually write something like:
typedef Eigen::Array<float, 8, 1> v8sf; // Eigen::Array allows element-wise standard operations
typedef std::complex<v8sf> complex8;
complex8 prod(const complex8& a, const complex8& b)
{
return a*b;
}
Or more generic (using C++11):
template<int size, typename Scalar = float> using complexX = std::complex<Eigen::Array<Scalar, size, 1> >;
template<int size>
complexX<size> prod(const complexX<size>& a, const complexX<size>& b)
{
return a*b;
}
When compiled with -mavx -O2, this compiles to something like this (using g++-5.4):
vmovaps 32(%rsi), %ymm1
movq %rdi, %rax
vmovaps (%rsi), %ymm0
vmovaps 32(%rdi), %ymm3
vmovaps (%rdi), %ymm4
vmulps %ymm0, %ymm3, %ymm2
vmulps %ymm4, %ymm1, %ymm5
vmulps %ymm4, %ymm0, %ymm0
vmulps %ymm3, %ymm1, %ymm1
vaddps %ymm5, %ymm2, %ymm2
vsubps %ymm1, %ymm0, %ymm0
vmovaps %ymm2, 32(%rdi)
vmovaps %ymm0, (%rdi)
vzeroupper
ret
For reasons not obvious to me, this is actually hidden in a method which is called by the actual method, which just moves around some memory -- I don't know why Eigen/gcc does not assume that the arguments are already properly aligned. If I compile the same with clang 3.8.0 (and the same arguments), it is compiled to just:
vmovaps (%rsi), %ymm0
vmovaps %ymm0, (%rdi)
vmovaps 32(%rsi), %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps (%rdi), %ymm1
vmovaps (%rdx), %ymm2
vmovaps 32(%rdx), %ymm3
vmulps %ymm2, %ymm1, %ymm4
vmulps %ymm3, %ymm0, %ymm5
vsubps %ymm5, %ymm4, %ymm4
vmulps %ymm3, %ymm1, %ymm1
vmulps %ymm0, %ymm2, %ymm0
vaddps %ymm1, %ymm0, %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps %ymm4, (%rdi)
movq %rdi, %rax
vzeroupper
retq
Again, the memory-movement at the beginning is weird, but at least that is vectorized. For both gcc and clang this get optimized away when called in a loop, however:
complex8 f8(complex8 x[], int n) {
if(n==0)
return complex8(v8sf::Ones(),v8sf::Zero()); // I guess you want p = 1 + 0*i at the beginning?
complex8 p = x[0];
for (int i = 1; i < n; i++) p = prod(p, x[i]);
return p;
}
The difference here is that clang will unroll that outer loop to 2 multiplications per loop. On the other hand, gcc will use fused-multiply-add instructions when compiled with -mfma.
The f8 function can of course also be generalized to arbitrary dimensions:
template<int size>
complexX<size> fX(complexX<size> x[], int n) {
using S= typename complexX<size>::value_type;
if(n==0)
return complexX<size>(S::Ones(),S::Zero());
complexX<size> p = x[0];
for (int i = 1; i < n; i++) p *=x[i];
return p;
}
And for reducing the complexX<N> to a single std::complex the following function can be used:
// only works for powers of two
template<int size> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<size>& var) {
complexX<size/2> a(var.real().template head<size/2>(), var.imag().template head<size/2>());
complexX<size/2> b(var.real().template tail<size/2>(), var.imag().template tail<size/2>());
return redux(a*b);
}
template<> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<1>& var) {
return std::complex<float>(var.real()[0], var.imag()[0]);
}
However, depending on whether I use clang or g++, I get quite different assembler output. Overall, g++ has a tendency to fail to inline loading the input arguments, and clang fails to use FMA operations (YMMV ...)
Essentially, you need to inspect the generated assembler code anyway. And more importantly, you should benchmark the code (not sure, how much impact this routine has in your overall problem).
Also, I wanted to note that Eigen actually is a linear algebra library. Exploiting it for pure portable SIMD code generation is not really what is designed for.
If Portability is your main concern, there are many libraries here which provide SIMD instructions in their own syntax. Most of them do the explicit vectorization more simple and portable than intrinsics. This Library (UME::SIMD) is recently published and has a great performance
In this paper(UME::SIMD) an interface based on Vc has been established which
is named UME::SIMD. It allows the programmer to access the SIMD
capabilities without the need for extensive knowledge of SIMD ISAs.
UME::SIMD provides a simple, flexible and portable abstraction for
explicit vectorization without performance losses compared to
intrinsics
I don't think you have a fully general solution for this. You can increase your "vector_size" to 32:
typedef float v4sf __attribute__ ((vector_size (32)));
Also increase all arrays to have 8 elements:
typedef float v8sf __attribute__ ((vector_size (32)));
typedef union {
v8sf v;
float e[8];
} float8;
typedef struct {
float8 x;
float8 y;
} complex8;
static complex8 complex8_mul(complex8 a, complex8 b) {
return (complex8){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
This will make the compiler able to generate AVX512 code (don't forget to add -mavx512f), but will make your code slightly worse in SSE by making memory transfers sub-optimal. However, it will certainly not disable SSE vectorization.
You could keep both versions (with 4 and with 8 array elements), switching between them by some flag, but it might be too tedious for little benefit.

Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a));.
For instance:
void MyFunction(float y)
{
__m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128
//do whatever it is with 'a'
}
I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles (__m128d)?
Almost every _ss and _ps intrinsic / instruction has a double version with a _sd or _pd suffix. (Scalar Double or Packed Double).
For example, search (double in Intel's intrinsic finder to find intrinsic functions that take a double as the first arg. Or just figure out what optimal asm would be, then look up the intrinsics for those instructions in the insn ref manual. Except that it doesn't list all the intrinsics for movsd, so searching for an instruction name in the intrinsics finder often works.
re: header files: always just include <immintrin.h>. It includes all Intel SSE/AVX intrinsics.
See also ways to put a float into a vector, and the sse tag wiki for links about how to shuffle vectors. (i.e. the tables of shuffle instructions in Agner Fog's optimizing assembly guide)
(see below for a godbolt link to some interesting compiler output)
re: your sequence
Only use _mm_move_ss (or sd) if you actually want to merge two vectors.
You don't show how m is defined. Your use of a as the variable name for the float and the vector imply that the only useful information in the vector is the float arg. The variable-name clash of course means it doesn't compile.
There unfortunately doesn't seem to be any way to just "cast" a float or double into a vector with garbage in the upper 3 elements, like there is for __m128 -> __m256:
__m256 _mm256_castps128_ps256 (__m128 a). I posted a new question about this limitation with intrinsics: How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
I tried using _mm_undefined_ps() to achieve this, hoping this would clue in the compiler that it can just leave the incoming high garbage in place, in
// don't use this, it doesn't make better code
__m128d double_to_vec_highgarbage(double x) {
__m128d undef = _mm_undefined_pd();
__m128d x_zeroupper = _mm_set_sd(x);
return _mm_move_sd(undef, x_zeroupper);
}
but clang3.8 compiles it to
# clang3.8 -O3 -march=core2
movq xmm0, xmm0 # xmm0 = xmm0[0],zero
ret
So no advantage, still zeroing the upper half instead of compiling it to just a ret. gcc actually makes pretty bad code:
double_to_vec_highgarbage: # gcc5.3 -march=nehalem
movsd QWORD PTR [rsp-16], xmm0 # %sfp, x
movsd xmm1, QWORD PTR [rsp-16] # D.26885, %sfp
pxor xmm0, xmm0 # __Y
movsd xmm0, xmm1 # tmp93, D.26885
ret
_mm_set_sd appears to be the best way to turn a scalar into a vector.
__m128d double_to_vec(double x) {
return _mm_set_sd(x);
}
clang compiles it to a movq xmm0,xmm0, gcc to a store/reload with -march=generic.
Other interesting compiler outputs from the float and double versions on the Godbolt compiler explorer
float_to_vec: # gcc 5.3 -O3 -march=core2
movd eax, xmm0 # x, x
movd xmm0, eax # D.26867, x
ret
float_to_vec: # gcc5.3 -O3 -march=nehalem
insertps xmm0, xmm0, 0xe # D.26867, x
ret
double_to_vec: # gcc5.3 -O3 -march=nehalem. It could still have use movq or insertps, instead of this longer-latency store-forwarding round trip
movsd QWORD PTR [rsp-16], xmm0 # %sfp, x
movsd xmm0, QWORD PTR [rsp-16] # D.26881, %sfp
ret
float_to_vec: # clang3.8 -O3 -march=core2 or generic (no -march)
xorps xmm1, xmm1
movss xmm1, xmm0 # xmm1 = xmm0[0],xmm1[1,2,3]
movaps xmm0, xmm1
ret
double_to_vec: # clang3.8 -O3 -march=core2, nehalem, or generic (no -march)
movq xmm0, xmm0 # xmm0 = xmm0[0],zero
ret
float_to_vec: # clang3.8 -O3 -march=nehalem
xorps xmm1, xmm1
blendps xmm0, xmm1, 14 # xmm0 = xmm0[0],xmm1[1,2,3]
ret
So both clang and gcc use different strategies for float vs. double, even when they could use the same strategy.
Using integer operations like movq between floating-point operations causes extra bypass delay latency. Using insertps to zero the upper elements of the input register should be the best strategy for float or double, so all compilers should use that when SSE4.1 is available. xorps + blend is good, too, and can run on more ports than insertps. The store/reload is probably the worst, unless we're bottlenecked on ALU throughput, and latency doesn't matter.
_mm_move_sd, _mm_set_sd. They're SSE2 intrinsics (and not SSE), so you'll need #include <emmintrin.h>.

Nasm, C++, passing class object

I have some class in my cpp file.
class F{
private:
int id;
float o;
float p;
float s;
static int next;
public:
F(double o, double s = 0.23, double p = 0.0):
id(next++), o(o),
p(p), s(s){}
};
int F::next = 0;
extern "C" float pod(F f);
int main(){
F bur(1000, 0.23, 100);
pod(bur);
return 0;
}
and I'm trying to pass class object burto function pod which is defined in my asm file. However I have big problem getting values from this class object.
In asm program I have 0.23 in XMM1, 100 in XMM2 but I can't find where 1000 is stored.
I don't know why you are seeing 100 in xmm2, I suspect that that is entirely conincidence. The easiest way to see how your struct is being passed is to compile the C++ code.
With cruft removed, my compiler does this:
main:
.LFB3:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl _ZN1F4nextE(%rip), %edi # load F::next into edi/rdi
movq .LC3(%rip), %xmm0 # load { 0.23, 100 } into xmm0
leal 1(%rdi), %eax # store rdi + 1 into eax
movl %eax, _ZN1F4nextE(%rip) # store eax back into F::next
movabsq $4934256341737799680, %rax # load { 1000.0, 0 } into rax
orq %rax, %rdi # or rax into pre-increment F::next in rdi
call pod
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.LC3:
.quad 4497835022170456064
The constant 4497835022170456064 is 3E6B851F42C80000 in hex, and if you look at the most significant four bytes (3E6B851F), this is 0.23 when interpreted as a single precision float, and the least significant four bytes (42C80000) are 100.0.
Similarly the most significant four bytes of the constant 4934256341737799680 (hex 447A000000000000) are 1000.0.
So, bur.id and bur.o are passed in rdi and bur.p and bur.s are passed in xmm0.
The reason for this is documented in the x86-64 abi reference. In extreme summary, because the first fwo fields are small enough, of mixed type and one of them is integer, they are passed in a general purpose register (rdi being the first general purpose parameter register), because the next two fields are both float, they are passed in an SSE register.
You want to have a look at calling-convention compilation from Agner's here. Depending on compiler, operating system and whether you are in 32 o 64 bits, different things may happen. (see Table 5 chapter 7).
For 64bits linux for instance, since you object contains different values (see table 6), the R case seems to apply:
Entire object is transferred in integer registers and/or XMM registers if the size is no bigger than 128 bits, otherwise on the stack. Each 64-bit part of the object is transferred in an XMM register if it contains only float or double, or in an integer register if it contains integer types or mixed integer and float. Two consecutive floats can be packed into the lower half of one XMM register.
In your case, the class fits in 128 bits. The experiment from #CharlesBailey illustrates this behavior. According to the convention
... or in an integer register if it contains integer types or mixed integer and float. Two consecutive floats can be packed into the lower half of one XMM register. Examples: int and float: RDI.
first int register rdi should hold id and o where xmm0 should hold p and s.
Seeing 100 within xmm2 might be a side effect of initialization as this is passed as a double to the struct constructor.

Why does my data not seem to be aligned?

I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initialize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));
When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned
movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash
-The variables do not seem to be correct
movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values
-What really confuses me is that this makes everything work (but is too slow):
__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0
m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
into: (see the disassembly window when the app crashes in debug mode)
movaps xmm0, xmmword ptr [t]
That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.
You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:
__asm mov eax, m_sincos ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this in a complete program, something odd occurs:
#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>
int main()
{
static __m128 *m_sincos;
int Bins = 4;
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++) {
m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
__asm movaps xmm0, m_sincos[t];
__asm mov eax, m_sincos
__asm mov ebx, t
__asm shl ebx, 4
__asm movaps xmm0, [eax+ebx];
}
return 0;
}
When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?
A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.
If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.
You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.