I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initialize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));
When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned
movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash
-The variables do not seem to be correct
movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values
-What really confuses me is that this makes everything work (but is too slow):
__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0
m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
into: (see the disassembly window when the app crashes in debug mode)
movaps xmm0, xmmword ptr [t]
That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.
You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:
__asm mov eax, m_sincos ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this in a complete program, something odd occurs:
#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>
int main()
{
static __m128 *m_sincos;
int Bins = 4;
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++) {
m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
__asm movaps xmm0, m_sincos[t];
__asm mov eax, m_sincos
__asm mov ebx, t
__asm shl ebx, 4
__asm movaps xmm0, [eax+ebx];
}
return 0;
}
When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?
A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.
If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.
You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.
Related
I need to support dynamic libraries and static linking of object files for 32 bit platforms (x86): Win32, Linux32 and MacOS32. The problem occurs when passing FPU arguments (float and double). By default, they are passed in SSE registers, not the stack. I am not against SSE, but I need the arguments and the result to be passed standardly - through the stack and the FPU.
I tried (godbolt) setting the -mno-sse option, and this produces the desired result. But I would not want to completely abandon SSE, I would sometimes like to use intrinsics and/or use MMX/SSE optimizations.
__attribute__((stdcall))
long double test(int* num, float f, double d)
{
*num = sizeof(long double);
return f * d;
}
/*-target i386-windows-gnu -c -O3*/
push ebp
mov ebp, esp
and esp, -8
sub esp, 8
movss xmm0, dword ptr [ebp + 12] # xmm0 = mem[0],zero,zero,zero
mov eax, dword ptr [ebp + 8]
cvtss2sd xmm0, xmm0
mov dword ptr [eax], 12
mulsd xmm0, qword ptr [ebp + 16]
movsd qword ptr [esp], xmm0
fld qword ptr [esp]
mov esp, ebp
pop ebp
ret 16
/*-target i386-windows-gnu -mno-sse -c -O3*/
mov eax, dword ptr [esp + 4]
mov dword ptr [eax], 12
fld dword ptr [esp + 8]
fmul qword ptr [esp + 12]
ret 16
Both versions of your function are using the same calling convention
By default, they are passed in SSE registers, not the stack.
That's not what your asm output shows, and not what happens. Notice that your first function loads its dword float arg from the stack into xmm0, then using mulsd with the qword double arg also from the stack. movss xmm0, dword ptr [ebp + 12] is a load that destroys the old contents of XMM0; XMM0 is not an input to this function.
Then, to return the retval in x87 st0 as per the crusty old 32-bit calling convention you're using, it uses a movsd store to the stack and an fld x87 load.
The * operator promotions the float to double to match the other operand, resulting in a double multiply, not long double. Promotion from double to long double doesn't happen until that temporary double result is returned.
It looks like clang defaults to what gcc would call -mfpmath=sse if available. This is normally good, except for small functions where the x87 return value calling convention gets in the way. (Also note that x87 has "free" promotion from float and double to long double, as part of how fld dword and qword work.) Clang isn't checking to see how much overhead it's going to cost to use SSE math in a small function; here it would obviously have been more efficiently to use x87 for one multiply.
But anyway, -mno-sse is not changing the ABI; read your asm more carefully. If it was, the generated asm would suck less!
On Windows, if you're stuck making 32-bit code at all, vectorcall should be a better way to pass/return FP vars when possible: it can use XMM registers to pass/return. Obviously any ABIs that are set in stone (like for existing libraries) need to be declared correctly so the compiler calls them / receives return values from them correctly.
What you currently have is stdcall with FP args on the stack and returned in st0.
BTW, a lot of the code in your first function is from clang aligning the stack to spill/reload the temporary double; the Windows ABI only guarantees 4-byte stack alignment. This is amount of work to avoid the risk of a cache-line split is almost certainly not worth it. Especially when it could have just destroyed its double d stack arg as scratch space, and hoped the caller had aligned that. Optimization is enabled, it's just setting up a frame pointer for it can and esp without losing the old ESP.
You could use return f * (long double)d;
That compiles to identical asm to the -mno-sse version. https://godbolt.org/z/LK0s_5
SSE2 doesn't support 80-bit x87 types, so clang is forced to use fmul. It ends up not messing around at all with SSE, and then the result is where it needs it for a return value.
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
For some reason one of my functions is calling an SSE instruction movaps with unaligned parameter, which causes a crash. It happens on the first line of the function, the rest is needed to be there just for crash to happen, but is ommited for clarity.
Vec3f CrashFoo(
const Vec3f &aVec3,
const float aFloat,
const Vec2f &aVec2)
{
const Vec3f vecNew =
Normalize(Vec3f(aVec3.x, aVec3.x, std::max(aVec3.x, 0.0f)));
// ...
}
This is how I call it from the debugging main:
int32_t main(int32_t argc, const char *argv[])
{
Vec3f vec3{ 0.00628005248f, -0.999814332f, 0.0182171166f };
Vec2f vec2{ 0.947231591f, 0.0522233732f };
float floatVal{ 0.010f };
Vec3f vecResult = CrashFoo(vec3, floatVal, vec2);
return (int32_t)vecResult.x;
}
This is the disassembly from the beginning of the CrashFoo function to the line where it crashes:
00007FF7A7DC34F0 mov rax,rsp
00007FF7A7DC34F3 mov qword ptr [rax+10h],rbx
00007FF7A7DC34F7 push rdi
00007FF7A7DC34F8 sub rsp,80h
00007FF7A7DC34FF movaps xmmword ptr [rax-18h],xmm6
00007FF7A7DC3503 movss xmm6,dword ptr [rdx]
00007FF7A7DC3507 movaps xmmword ptr [rax-28h],xmm7
00007FF7A7DC350B mov dword ptr [rax+18h],0
00007FF7A7DC3512 mov rdi,r9
00007FF7A7DC3515 mov rbx,rcx
00007FF7A7DC3518 movaps xmmword ptr [rax-38h],xmm8
00007FF7A7DC351D movaps xmmword ptr [rax-48h],xmm9
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
00007FF7A7DC352B xorps xmm8,xmm8
00007FF7A7DC352F comiss xmm8,xmm6
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
My understanding is that it first does the usual function call stuff and then it starts preparing the playground by saving the current content of some SSE registers (xmm6-xmm11) onto the stack so that they are free to be used by the subsequent code. The xmm* registers are stored one after another to addresses from [rax-18h] to [rax-68h], which are nicely aligned to 16 bytes since rax=0xe4d987f788, but before the xmm11 register gets stored, the rax is increased by 18h which breaks the alignment causing crash. The xorps and comiss lines is where the actual code starts (std::max's comparison with 0). When I remove std::max it works nicely.
Do you see any reason for this behaviour?
Additional info
I uploaded a small compilable example that crashes for me in my Visual Studio, but not in the IDEone.
The code is compiled in Visual Studio 2013 Update 5 (x64 release, v120). I've set the "Struct Member Alignment" setting of the project to 16 bytes, but with little improvement and there are no packing pragma in the structures that I use. The error message is:
First-chance exception at 0x00007ff7a7dc3533 in PG3Render.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.
gcc and clang are both fine, and make non-crashing non-vectorized code for your example. (Of course, I'm compiling for the Linux SysV ABI where none of the vector regs are caller-saved, so they weren't generating code to save xmm{6..15} on the stack in the first place.)
Your IDEone link doesn't demonstrate a crash either, so IDK. I there are online compile & run sites that have MSVC as an option. You can even get asm out of them if your program uses system to run a disassembler on itself. :P
The asm output you posted is guaranteed to crash, for any possible value of rax:
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
...
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
Accounting for the LEA, the second store address is [init_rax-50h], which is only 8B offset from the earlier stores. One or the other will fault. This appears to be a compiler bug that you should report.
I have no idea why your compiler would use lea instead of add rax, 18h. It does it right before clobbering the flags with a comiss
When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in
static inline __m128i negate(__m128i a)
{
static __m128i zero = __mm_setzero_si128();
return _mm_sub_epi16(zero, a);
}
It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a true function instead of the _mm_setzero_si128() intrinsic. It only seems to be possible in C++, not in C, by the way.)
(1) However, once this initialization has happened: Does this block a xmm register for the rest of the program?
(2) Even worse: If such a static local variable is used in multiple functions, would it block multiple xmm registers?
(3) The other way round: If it is not blocking a xmm register, would the zero variable always be reloaded from memory when the function is called? Then the static local variable would be pointless since it would be faster to use _mm_setzero_si128().
As an alternative, I was thinking about putting zero into a global static variable that would be initialized at program start:
static __m128i zero = _mm_setzero_si128();
(4) Would the global variable stay in a xmm register while the program runs?
Thanks a lot for your help!
(Since this also applies to AVX intrinsics, I also added the AVX tag.)
Answering the question that should really be asked here: you should not be worrying about this at all. Zeroing a register via xor effectively costs nothing at all most of the time. Modern x86 processors recognize this idiom and handle the zeroing directly in register rename; no µop needs to issue at all. The only time this can slow you down is if you are bound by the front-end, but that is a rather rare situation to be in.
While variations on these questions might be worth pondering in other circumstances (and Mystical's comment gives some good leads on how to answer them yourself), you should really just use setzero and call it a day.
In regards to this particular operation you should do at Stephen Canon says and do
static inline Vec8s operator - (Vec8s const & a) {
return _mm_sub_epi16(_mm_setzero_si128(), a);
}
That's taken directly from Agner Fog's Vector Class Library.
But let's consider what the static keyword does. When you declare a variable using static it uses static storage. This places it in the data section (which includes the .bss section) of your object file.
#include <x86intrin.h>
extern "C" void foo2(__m128i a);
static const __m128i zero = _mm_setzero_si128();
static inline __m128i negate(__m128i a) {
return _mm_sub_epi16(zero, a);
}
extern "C" void foo(__m128i a, __m128i b) {
foo2(negate(a));
}
I do g++ -O3 -c static.cpp and then look at the diassembly and sections. I see
there is a .bss section with a label _ZL4zero. Then there is a code startup section which writes the static variable in the .bss section.
.text.startup
pxor xmm0, xmm0
movaps XMMWORD PTR _ZL4zero[rip], xmm0
ret
The foo function
movdqa xmm1, XMMWORD PTR _ZL4zero[rip]
psubw xmm1, xmm0
movdqa xmm0, xmm1
So GCC never uses a XMM register for the static variable. It reads from memory in the data section.
What if we did _mm_sub_epi16(_mm_setzero_si128(),a)? Then GCC produces for foo
pxor xmm1, xmm1
psubw xmm1, xmm0
movdqa xmm0, xmm1
On Intel processors since Sandy Bridge the pxor is "free". On processors before that it's almost free. So this is clearly a better solution than reading from memory.
What if we tried _mm_sub_epi16(_mm_set1_epi32(-1),a). In that case GCC produces
pcmpeqd xmm1, xmm1
psubw xmm1, xmm0
movdqa xmm0, xmm1
The pcmpeqd instruction is not free on any processor but it's still better than reading from memory using movdqa. Okay, so 0 and -1 are special. What about _mm_sub_epi16(_mm_set1_epi32(1))? In this case GCC produces for foo
movdqa xmm1, XMMWORD PTR .LC0[rip]
psubw xmm1, xmm0
movdqa xmm0, xmm1
That's essentially the same as using a static variable! When I look at the sections I see that .LC0 points to a read only data section (.rodata).
Edit: here is a way to get GCC use use a global variable in register.
register __m128i zero asm ("xmm15") = _mm_set1_epi32(1);
This produces
movdqa xmm2, xmm15
psubw xmm2, xmm0
movdqa xmm0, xmm2
Since you use vectors for efficiency, your code has a problem.
A static variable that isn't initialised with a constant will be initialised at runtime. In a thread safe way. The first time your inline function is called, the static variable is initialised. On every single call after that, a check is made whether the static variable needs initialising or not.
So on every call, there is a check, then there is a load from memory. If you don't use a static variable, there's probably a single instruction creating the value, plus plenty of opportunity for optimisation. Loading from memory is slow.
And you can have as many static variables as you like. The compiler will handle anything you throw at it.
I think I can add an interesting point to the discussion, particularly to my comment on _mm_abs_ps(). If I define
static inline __m128 _mm_abs_ps_2(__m128 x) {
__m128 signMask = _mm_set1_ps(-0.0F);
return _mm_andnot_ps(signMask, x);
}
(Agner Fog's VCL http://www.agner.org/optimize/#vectorclass uses an integer set1, a cast, and an AND operation instead, but that should in effect be the same) and use the function in a loop
float *p = data;
for (int i = 0; i < LEN; i += 4, p += 4)
_mm_store_ps(p, _mm_abs_ps_2(_mm_load_ps(p)));
then gcc (4.6.3, -O3) is clever enough to avoid repeatedly executing _mm_set1_ps by moving it outside the loop:
vmovaps xmm1, XMMWORD PTR .LC1[rip] # tmp108,
mov rax, rsp # p,
.L3:
vandnps xmm0, xmm1, XMMWORD PTR [rax] # tmp102, tmp108, MEM[base: p_54, offset: 0B]
vmovaps XMMWORD PTR [rax], xmm0 # MEM[base: p_54, offset: 0B], tmp102
add rax, 16 # p,
cmp rax, rbp # p, D.7371
jne .L3 #,
.LC1:
.long 2147483648
.long 2147483648
.long 2147483648
.long 2147483648
So, probably in most cases one shouldn't worry at all about repeatedly setting some xmm register to a constant inside some function.
I'm now working in a small optimisation of a basic dot product function, by using SSE instructions in visual studio.
Here is my code : (function call convention is cdecl) :
float SSEDP4(const vect & vec1, const vect & vec2)
{
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm1, xmmword ptr[edx]
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd eax, xmm1, 03h
}
}
vect is a simple struct that contains 4 single precision floats, non aligned to 16 bytes (that is why I use movups and not movaps)
vec1 is initialized with (1.0, 1.2, 1.4, 1.0) and vec2 with (2.0, 1.8, 1.6, 1.0)
Everything compiles well, but at execution, I got 0 in both XMM registers, and so as result
while debugging, visual studio shows me 2 registers (MMX1 and MMX2, or sometimes MMX2 and MMX3) which are 64 bits registers, but no XMM and everything to 0.
Does someone has an idea of what's happening ?
Thank you in advance :)
There are a couple of ways to get at SSE instructions on MSVC++:
Compiler Intrinsics -> http://msdn.microsoft.com/en-us/library/t467de55.aspx
External MASM file.
Inline assembly (as in your example code) is no longer a reasonable option because it will not compile when building for non 32 bit, x86, systems. (E.g. building a 64 bit binary will fail)
Moreover, assembly blocks inhibit most optimizations. This is bad for you because even simple things like inlining won't happen for your function. Intrinsics work in a manner that does not defeat optimizers.
You compiled and ran correctly, so you are at least able to use SSE.
In order to view SSE registers in the Registers window, right click on the Registers window and select SSE. That should let you see the XMM registers.
You can also use #xmm<register><component> (e.g., #xmm00 to view xmm0[0]) in the watch window to look at individual components of the XMM registers.
Now, as for your actual problem, you are overwriting xmm1 with [edx] instead of stuffing that into xmm2.
Also, scalar floating point values are returned on the x87 stack in st(0). Instead of trying to remember how to do that, I simply store the result in a stack variable and let the compiler do it for me:
float SSEDP4(const vect & vec1, const vect & vec2)
{
float result;
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm2, xmmword ptr[edx] // xmm2, not xmm1
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd [result], xmm1, 03h
}
return result;
}