I need to support dynamic libraries and static linking of object files for 32 bit platforms (x86): Win32, Linux32 and MacOS32. The problem occurs when passing FPU arguments (float and double). By default, they are passed in SSE registers, not the stack. I am not against SSE, but I need the arguments and the result to be passed standardly - through the stack and the FPU.
I tried (godbolt) setting the -mno-sse option, and this produces the desired result. But I would not want to completely abandon SSE, I would sometimes like to use intrinsics and/or use MMX/SSE optimizations.
__attribute__((stdcall))
long double test(int* num, float f, double d)
{
*num = sizeof(long double);
return f * d;
}
/*-target i386-windows-gnu -c -O3*/
push ebp
mov ebp, esp
and esp, -8
sub esp, 8
movss xmm0, dword ptr [ebp + 12] # xmm0 = mem[0],zero,zero,zero
mov eax, dword ptr [ebp + 8]
cvtss2sd xmm0, xmm0
mov dword ptr [eax], 12
mulsd xmm0, qword ptr [ebp + 16]
movsd qword ptr [esp], xmm0
fld qword ptr [esp]
mov esp, ebp
pop ebp
ret 16
/*-target i386-windows-gnu -mno-sse -c -O3*/
mov eax, dword ptr [esp + 4]
mov dword ptr [eax], 12
fld dword ptr [esp + 8]
fmul qword ptr [esp + 12]
ret 16
Both versions of your function are using the same calling convention
By default, they are passed in SSE registers, not the stack.
That's not what your asm output shows, and not what happens. Notice that your first function loads its dword float arg from the stack into xmm0, then using mulsd with the qword double arg also from the stack. movss xmm0, dword ptr [ebp + 12] is a load that destroys the old contents of XMM0; XMM0 is not an input to this function.
Then, to return the retval in x87 st0 as per the crusty old 32-bit calling convention you're using, it uses a movsd store to the stack and an fld x87 load.
The * operator promotions the float to double to match the other operand, resulting in a double multiply, not long double. Promotion from double to long double doesn't happen until that temporary double result is returned.
It looks like clang defaults to what gcc would call -mfpmath=sse if available. This is normally good, except for small functions where the x87 return value calling convention gets in the way. (Also note that x87 has "free" promotion from float and double to long double, as part of how fld dword and qword work.) Clang isn't checking to see how much overhead it's going to cost to use SSE math in a small function; here it would obviously have been more efficiently to use x87 for one multiply.
But anyway, -mno-sse is not changing the ABI; read your asm more carefully. If it was, the generated asm would suck less!
On Windows, if you're stuck making 32-bit code at all, vectorcall should be a better way to pass/return FP vars when possible: it can use XMM registers to pass/return. Obviously any ABIs that are set in stone (like for existing libraries) need to be declared correctly so the compiler calls them / receives return values from them correctly.
What you currently have is stdcall with FP args on the stack and returned in st0.
BTW, a lot of the code in your first function is from clang aligning the stack to spill/reload the temporary double; the Windows ABI only guarantees 4-byte stack alignment. This is amount of work to avoid the risk of a cache-line split is almost certainly not worth it. Especially when it could have just destroyed its double d stack arg as scratch space, and hoped the caller had aligned that. Optimization is enabled, it's just setting up a frame pointer for it can and esp without losing the old ESP.
You could use return f * (long double)d;
That compiles to identical asm to the -mno-sse version. https://godbolt.org/z/LK0s_5
SSE2 doesn't support 80-bit x87 types, so clang is forced to use fmul. It ends up not messing around at all with SSE, and then the result is where it needs it for a return value.
Related
I am curious why the following piece of code:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNO";
}
when compiled with -O3 yields the following code:
main: # #main
xor eax, eax
ret
(I perfectly understand that there is no need for the unused a so the compiler can entirely omit it from the generated code)
However the following program:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNOP"; // <-- !!! One Extra P
}
yields:
main: # #main
push rbx
sub rsp, 48
lea rbx, [rsp + 32]
mov qword ptr [rsp + 16], rbx
mov qword ptr [rsp + 8], 16
lea rdi, [rsp + 16]
lea rsi, [rsp + 8]
xor edx, edx
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)
mov qword ptr [rsp + 16], rax
mov rcx, qword ptr [rsp + 8]
mov qword ptr [rsp + 32], rcx
movups xmm0, xmmword ptr [rip + .L.str]
movups xmmword ptr [rax], xmm0
mov qword ptr [rsp + 24], rcx
mov rax, qword ptr [rsp + 16]
mov byte ptr [rax + rcx], 0
mov rdi, qword ptr [rsp + 16]
cmp rdi, rbx
je .LBB0_3
call operator delete(void*)
.LBB0_3:
xor eax, eax
add rsp, 48
pop rbx
ret
mov rdi, rax
call _Unwind_Resume
.L.str:
.asciz "ABCDEFGHIJKLMNOP"
when compiled with the same -O3. I don't understand why it does not recognize that the a is still unused, regardless that the string is one byte longer.
This question is relevant to gcc 9.1 and clang 8.0, (online: https://gcc.godbolt.org/z/p1Z8Ns) because other compilers in my observation either entirely drop the unused variable (ellcc) or generate code for it regardless the length of the string.
This is due to the small string optimization. When the string data is less than or equal 16 characters, including the null terminator, it is stored in a buffer local to the std::string object itself. Otherwise, it allocates memory on the heap and stores the data over there.
The first string "ABCDEFGHIJKLMNO" plus the null terminator is exactly of size 16. Adding "P" makes it exceed the buffer, hence new is being called internally, inevitably leading to a system call. The compiler can optimize something away if it's possible to ensure that there are no side effects. A system call probably makes it impossible to do this - by constrast, changing a buffer local to the object under construction allows for such a side effect analysis.
Tracing the local buffer in libstdc++, version 9.1, reveals these parts of bits/basic_string.h:
template<typename _CharT, typename _Traits, typename _Alloc>
class basic_string
{
// ...
enum { _S_local_capacity = 15 / sizeof(_CharT) };
union
{
_CharT _M_local_buf[_S_local_capacity + 1];
size_type _M_allocated_capacity;
};
// ...
};
which lets you spot the local buffer size _S_local_capacity and the local buffer itself (_M_local_buf). When the constructor triggers basic_string::_M_construct being called, you have in bits/basic_string.tcc:
void _M_construct(_InIterator __beg, _InIterator __end, ...)
{
size_type __len = 0;
size_type __capacity = size_type(_S_local_capacity);
while (__beg != __end && __len < __capacity)
{
_M_data()[__len++] = *__beg;
++__beg;
}
where the local buffer is filled with its content. Right after this part, we get to the branch where the local capacity is exhausted - new storage is allocated (through the allocate in M_create), the local buffer is copied into the new storage and filled with the rest of the initializing argument:
while (__beg != __end)
{
if (__len == __capacity)
{
// Allocate more space.
__capacity = __len + 1;
pointer __another = _M_create(__capacity, __len);
this->_S_copy(__another, _M_data(), __len);
_M_dispose();
_M_data(__another);
_M_capacity(__capacity);
}
_M_data()[__len++] = *__beg;
++__beg;
}
As a side note, small string optimization is quite a topic on its own. To get a feeling for how tweaking individual bits can make a difference at large scale, I'd recommend this talk. It also mentions how the std::string implementation that ships with gcc (libstdc++) works and changed during the past to match newer versions of the standard.
I was surprised the compiler saw through a std::string constructor/destructor pair until I saw your second example. It didn't. What you're seeing here is small string optimization and corresponding optimizations from the compiler around that.
Small string optimizations are when the std::string object itself is big enough to hold the contents of the string, a size and possibly a discriminating bit used to indicate whether the string is operating in small or big string mode. In such a case, no dynamic allocations occur and the string is stored in the std::string object itself.
Compilers are really bad at eliding unneeded allocations and deallocations, they are treated almost as if having side effects and are thus impossible to elide. When you go over the small string optimization threshold, dynamic allocations occur and the result is what you see.
As an example
void foo() {
delete new int;
}
is the simplest, dumbest allocation/deallocation pair possible, yet gcc emits this assembly even under O3
sub rsp, 8
mov edi, 4
call operator new(unsigned long)
mov esi, 4
add rsp, 8
mov rdi, rax
jmp operator delete(void*, unsigned long)
While the accepted answer is valid, since C++14 it's actually the case that new and delete calls can be optimized away. See this arcane wording on cppreference:
New-expressions are allowed to elide ... allocations made through replaceable allocation functions. In case of elision, the storage may be provided by the compiler without making the call to an allocation function (this also permits optimizing out unused new-expression).
...
Note that this optimization is only permitted when new-expressions are
used, not any other methods to call a replaceable allocation function:
delete[] new int[10]; can be optimized out, but operator
delete(operator new(10)); cannot.
This actually allows compilers to completely drop your local std::string even if it's very long. In fact - clang++ with libc++ already does this (GodBolt), since libc++ uses built-ins __new and __delete in its implementation of std::string - that's "storage provided by the compiler". Thus, we get:
main():
xor eax, eax
ret
with basically any-length unused string.
GCC doesn't do but I've recently opened bug reports about this; see this SO answer for links.
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
For some reason one of my functions is calling an SSE instruction movaps with unaligned parameter, which causes a crash. It happens on the first line of the function, the rest is needed to be there just for crash to happen, but is ommited for clarity.
Vec3f CrashFoo(
const Vec3f &aVec3,
const float aFloat,
const Vec2f &aVec2)
{
const Vec3f vecNew =
Normalize(Vec3f(aVec3.x, aVec3.x, std::max(aVec3.x, 0.0f)));
// ...
}
This is how I call it from the debugging main:
int32_t main(int32_t argc, const char *argv[])
{
Vec3f vec3{ 0.00628005248f, -0.999814332f, 0.0182171166f };
Vec2f vec2{ 0.947231591f, 0.0522233732f };
float floatVal{ 0.010f };
Vec3f vecResult = CrashFoo(vec3, floatVal, vec2);
return (int32_t)vecResult.x;
}
This is the disassembly from the beginning of the CrashFoo function to the line where it crashes:
00007FF7A7DC34F0 mov rax,rsp
00007FF7A7DC34F3 mov qword ptr [rax+10h],rbx
00007FF7A7DC34F7 push rdi
00007FF7A7DC34F8 sub rsp,80h
00007FF7A7DC34FF movaps xmmword ptr [rax-18h],xmm6
00007FF7A7DC3503 movss xmm6,dword ptr [rdx]
00007FF7A7DC3507 movaps xmmword ptr [rax-28h],xmm7
00007FF7A7DC350B mov dword ptr [rax+18h],0
00007FF7A7DC3512 mov rdi,r9
00007FF7A7DC3515 mov rbx,rcx
00007FF7A7DC3518 movaps xmmword ptr [rax-38h],xmm8
00007FF7A7DC351D movaps xmmword ptr [rax-48h],xmm9
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
00007FF7A7DC352B xorps xmm8,xmm8
00007FF7A7DC352F comiss xmm8,xmm6
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
My understanding is that it first does the usual function call stuff and then it starts preparing the playground by saving the current content of some SSE registers (xmm6-xmm11) onto the stack so that they are free to be used by the subsequent code. The xmm* registers are stored one after another to addresses from [rax-18h] to [rax-68h], which are nicely aligned to 16 bytes since rax=0xe4d987f788, but before the xmm11 register gets stored, the rax is increased by 18h which breaks the alignment causing crash. The xorps and comiss lines is where the actual code starts (std::max's comparison with 0). When I remove std::max it works nicely.
Do you see any reason for this behaviour?
Additional info
I uploaded a small compilable example that crashes for me in my Visual Studio, but not in the IDEone.
The code is compiled in Visual Studio 2013 Update 5 (x64 release, v120). I've set the "Struct Member Alignment" setting of the project to 16 bytes, but with little improvement and there are no packing pragma in the structures that I use. The error message is:
First-chance exception at 0x00007ff7a7dc3533 in PG3Render.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.
gcc and clang are both fine, and make non-crashing non-vectorized code for your example. (Of course, I'm compiling for the Linux SysV ABI where none of the vector regs are caller-saved, so they weren't generating code to save xmm{6..15} on the stack in the first place.)
Your IDEone link doesn't demonstrate a crash either, so IDK. I there are online compile & run sites that have MSVC as an option. You can even get asm out of them if your program uses system to run a disassembler on itself. :P
The asm output you posted is guaranteed to crash, for any possible value of rax:
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
...
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
Accounting for the LEA, the second store address is [init_rax-50h], which is only 8B offset from the earlier stores. One or the other will fault. This appears to be a compiler bug that you should report.
I have no idea why your compiler would use lea instead of add rax, 18h. It does it right before clobbering the flags with a comiss
I'm now working in a small optimisation of a basic dot product function, by using SSE instructions in visual studio.
Here is my code : (function call convention is cdecl) :
float SSEDP4(const vect & vec1, const vect & vec2)
{
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm1, xmmword ptr[edx]
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd eax, xmm1, 03h
}
}
vect is a simple struct that contains 4 single precision floats, non aligned to 16 bytes (that is why I use movups and not movaps)
vec1 is initialized with (1.0, 1.2, 1.4, 1.0) and vec2 with (2.0, 1.8, 1.6, 1.0)
Everything compiles well, but at execution, I got 0 in both XMM registers, and so as result
while debugging, visual studio shows me 2 registers (MMX1 and MMX2, or sometimes MMX2 and MMX3) which are 64 bits registers, but no XMM and everything to 0.
Does someone has an idea of what's happening ?
Thank you in advance :)
There are a couple of ways to get at SSE instructions on MSVC++:
Compiler Intrinsics -> http://msdn.microsoft.com/en-us/library/t467de55.aspx
External MASM file.
Inline assembly (as in your example code) is no longer a reasonable option because it will not compile when building for non 32 bit, x86, systems. (E.g. building a 64 bit binary will fail)
Moreover, assembly blocks inhibit most optimizations. This is bad for you because even simple things like inlining won't happen for your function. Intrinsics work in a manner that does not defeat optimizers.
You compiled and ran correctly, so you are at least able to use SSE.
In order to view SSE registers in the Registers window, right click on the Registers window and select SSE. That should let you see the XMM registers.
You can also use #xmm<register><component> (e.g., #xmm00 to view xmm0[0]) in the watch window to look at individual components of the XMM registers.
Now, as for your actual problem, you are overwriting xmm1 with [edx] instead of stuffing that into xmm2.
Also, scalar floating point values are returned on the x87 stack in st(0). Instead of trying to remember how to do that, I simply store the result in a stack variable and let the compiler do it for me:
float SSEDP4(const vect & vec1, const vect & vec2)
{
float result;
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm2, xmmword ptr[edx] // xmm2, not xmm1
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd [result], xmm1, 03h
}
return result;
}
I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initialize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));
When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned
movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash
-The variables do not seem to be correct
movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values
-What really confuses me is that this makes everything work (but is too slow):
__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0
m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
into: (see the disassembly window when the app crashes in debug mode)
movaps xmm0, xmmword ptr [t]
That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.
You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:
__asm mov eax, m_sincos ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this in a complete program, something odd occurs:
#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>
int main()
{
static __m128 *m_sincos;
int Bins = 4;
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++) {
m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
__asm movaps xmm0, m_sincos[t];
__asm mov eax, m_sincos
__asm mov ebx, t
__asm shl ebx, 4
__asm movaps xmm0, [eax+ebx];
}
return 0;
}
When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?
A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.
If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.
You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.