MOVAPS accesses unaligned address - c++

For some reason one of my functions is calling an SSE instruction movaps with unaligned parameter, which causes a crash. It happens on the first line of the function, the rest is needed to be there just for crash to happen, but is ommited for clarity.
Vec3f CrashFoo(
const Vec3f &aVec3,
const float aFloat,
const Vec2f &aVec2)
{
const Vec3f vecNew =
Normalize(Vec3f(aVec3.x, aVec3.x, std::max(aVec3.x, 0.0f)));
// ...
}
This is how I call it from the debugging main:
int32_t main(int32_t argc, const char *argv[])
{
Vec3f vec3{ 0.00628005248f, -0.999814332f, 0.0182171166f };
Vec2f vec2{ 0.947231591f, 0.0522233732f };
float floatVal{ 0.010f };
Vec3f vecResult = CrashFoo(vec3, floatVal, vec2);
return (int32_t)vecResult.x;
}
This is the disassembly from the beginning of the CrashFoo function to the line where it crashes:
00007FF7A7DC34F0 mov rax,rsp
00007FF7A7DC34F3 mov qword ptr [rax+10h],rbx
00007FF7A7DC34F7 push rdi
00007FF7A7DC34F8 sub rsp,80h
00007FF7A7DC34FF movaps xmmword ptr [rax-18h],xmm6
00007FF7A7DC3503 movss xmm6,dword ptr [rdx]
00007FF7A7DC3507 movaps xmmword ptr [rax-28h],xmm7
00007FF7A7DC350B mov dword ptr [rax+18h],0
00007FF7A7DC3512 mov rdi,r9
00007FF7A7DC3515 mov rbx,rcx
00007FF7A7DC3518 movaps xmmword ptr [rax-38h],xmm8
00007FF7A7DC351D movaps xmmword ptr [rax-48h],xmm9
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
00007FF7A7DC352B xorps xmm8,xmm8
00007FF7A7DC352F comiss xmm8,xmm6
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
My understanding is that it first does the usual function call stuff and then it starts preparing the playground by saving the current content of some SSE registers (xmm6-xmm11) onto the stack so that they are free to be used by the subsequent code. The xmm* registers are stored one after another to addresses from [rax-18h] to [rax-68h], which are nicely aligned to 16 bytes since rax=0xe4d987f788, but before the xmm11 register gets stored, the rax is increased by 18h which breaks the alignment causing crash. The xorps and comiss lines is where the actual code starts (std::max's comparison with 0). When I remove std::max it works nicely.
Do you see any reason for this behaviour?
Additional info
I uploaded a small compilable example that crashes for me in my Visual Studio, but not in the IDEone.
The code is compiled in Visual Studio 2013 Update 5 (x64 release, v120). I've set the "Struct Member Alignment" setting of the project to 16 bytes, but with little improvement and there are no packing pragma in the structures that I use. The error message is:
First-chance exception at 0x00007ff7a7dc3533 in PG3Render.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.

gcc and clang are both fine, and make non-crashing non-vectorized code for your example. (Of course, I'm compiling for the Linux SysV ABI where none of the vector regs are caller-saved, so they weren't generating code to save xmm{6..15} on the stack in the first place.)
Your IDEone link doesn't demonstrate a crash either, so IDK. I there are online compile & run sites that have MSVC as an option. You can even get asm out of them if your program uses system to run a disassembler on itself. :P
The asm output you posted is guaranteed to crash, for any possible value of rax:
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
...
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
Accounting for the LEA, the second store address is [init_rax-50h], which is only 8B offset from the earlier stores. One or the other will fault. This appears to be a compiler bug that you should report.
I have no idea why your compiler would use lea instead of add rax, 18h. It does it right before clobbering the flags with a comiss

Related

Why does MSVC generate nop instructions for atomic loads on x64?

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?
p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

How to make msvc vectorize float addition?

I have this code:
constexpr size_t S = 4;
void add(std::array<float, S>& a, std::array<float, S> b)
{
for (size_t i = 0; i < S; ++i)
a[i] += b[i];
}
Both, clang and gcc, realize that instead of doing 4 single additions they can do one packed addition, using the addps instruction. E.g. clang generates this:
movups xmm2, xmmword ptr [rdi]
movlhps xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
addps xmm0, xmm2
movups xmmword ptr [rdi], xmm0
ret
As you can see on godbolt, gcc is a bit behind clang as it needs more moves. But that's fine. My problem is msvc which is way worse as you can see:
mov eax, DWORD PTR _a$[esp-4]
movups xmm2, XMMWORD PTR _b$[esp-4]
movss xmm1, DWORD PTR [eax+4]
movaps xmm0, xmm2
addss xmm0, DWORD PTR [eax]
movss DWORD PTR [eax], xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 85 ; 00000055H
addss xmm1, xmm0
movaps xmm0, xmm2
shufps xmm0, xmm2, 170 ; 000000aaH
shufps xmm2, xmm2, 255 ; 000000ffH
movss DWORD PTR [eax+4], xmm1
movss xmm1, DWORD PTR [eax+8]
addss xmm1, xmm0
movss xmm0, DWORD PTR [eax+12]
addss xmm0, xmm2
movss DWORD PTR [eax+8], xmm1
movss DWORD PTR [eax+12], xmm0
ret 0
I tried different optimization levels, but /O2 seems to be the best. I also tried manually unrolling the loop, but no change for msvc.
So, is there a way to make msvc do the same optimization, using one addps instead of four addss? Or is there maybe a good reason why msvc doesn't do it?
Edit
By adding the /Qvec-report:2 flag as suggested by Shawn in the comments (thanks!) I found out that msvc thinks the loop is to small to have any benefit from vectorizing it. Clang and gcc have different opinions, but OK.
And indeed, if I change S to 16, msvc comes up with a vectorized version, even though it still provides a non vectorized branch (completely unnecessary in my opinion as S is known at compile time). In general, msvc's code looks like a mess compared to gcc and clang, see here.
I have tested the code you posted in Microsoft Visual Studio 2017 and it works with me. When I call your function add with aligned and non-aliased parameters, your function add compiles to the addps instruction, not addss. Maybe you are using an older version of Visual Studio?
However, I was able to reproduce your problem by deliberately giving the function non-aligned or aliased parameters. In order to accomplish this, I replaced the function parameters with C-style array pointers (because I don't know how exactly std::array is implemented) and deliberately called the function with aliased pointers, by making the two arrays overlap. In that case, the generated code calls addss four times instead of addps once. Deliberately passing an unaligned pointer had the same effect.
This behavior also makes sense. For vectorization to be meaningful, the compiler must be sure that the arrays do not overlap and that they are properly aligned. I believe alignment is less of an issue with AVX than with SSE.
Of course, the compiler must be able to determine whether there is possible aliasing or alignment issues at compile-time, not at run-time. Therefore, maybe the problem is that you are calling the function in such a way that the compiler can't be sure at compile-time whether the paramaters are aliased and whether the parameters are aligned. Compilers are sometimes not very smart at determining these things. However, as you have pointed out in the comments section, since you are passing one parameter by value, the compiler should be able to determine that there is no danger of overlap. Therefore, my guess is that it is an alignment issue, as the compiler is unsure at compile-time how the contents of std:array is aligned. As I am unable to reproduce your problem using std::array, you may want to post your code on how you are calling the function.
You can also enforce vectorization by explicitly calling the corresponding compiler intrinsic _mm_add_ps for the instruction addps.

C++ performance std::array vs std::vector

Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2

How can I use SSE (and SSE2, SSE3, etc.) extensions when building with Visual C++?

I'm now working in a small optimisation of a basic dot product function, by using SSE instructions in visual studio.
Here is my code : (function call convention is cdecl) :
float SSEDP4(const vect & vec1, const vect & vec2)
{
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm1, xmmword ptr[edx]
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd eax, xmm1, 03h
}
}
vect is a simple struct that contains 4 single precision floats, non aligned to 16 bytes (that is why I use movups and not movaps)
vec1 is initialized with (1.0, 1.2, 1.4, 1.0) and vec2 with (2.0, 1.8, 1.6, 1.0)
Everything compiles well, but at execution, I got 0 in both XMM registers, and so as result
while debugging, visual studio shows me 2 registers (MMX1 and MMX2, or sometimes MMX2 and MMX3) which are 64 bits registers, but no XMM and everything to 0.
Does someone has an idea of what's happening ?
Thank you in advance :)
There are a couple of ways to get at SSE instructions on MSVC++:
Compiler Intrinsics -> http://msdn.microsoft.com/en-us/library/t467de55.aspx
External MASM file.
Inline assembly (as in your example code) is no longer a reasonable option because it will not compile when building for non 32 bit, x86, systems. (E.g. building a 64 bit binary will fail)
Moreover, assembly blocks inhibit most optimizations. This is bad for you because even simple things like inlining won't happen for your function. Intrinsics work in a manner that does not defeat optimizers.
You compiled and ran correctly, so you are at least able to use SSE.
In order to view SSE registers in the Registers window, right click on the Registers window and select SSE. That should let you see the XMM registers.
You can also use #xmm<register><component> (e.g., #xmm00 to view xmm0[0]) in the watch window to look at individual components of the XMM registers.
Now, as for your actual problem, you are overwriting xmm1 with [edx] instead of stuffing that into xmm2.
Also, scalar floating point values are returned on the x87 stack in st(0). Instead of trying to remember how to do that, I simply store the result in a stack variable and let the compiler do it for me:
float SSEDP4(const vect & vec1, const vect & vec2)
{
float result;
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm2, xmmword ptr[edx] // xmm2, not xmm1
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd [result], xmm1, 03h
}
return result;
}

Why does my data not seem to be aligned?

I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initialize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));
When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned
movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash
-The variables do not seem to be correct
movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values
-What really confuses me is that this makes everything work (but is too slow):
__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0
m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
into: (see the disassembly window when the app crashes in debug mode)
movaps xmm0, xmmword ptr [t]
That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.
You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:
__asm mov eax, m_sincos ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this in a complete program, something odd occurs:
#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>
int main()
{
static __m128 *m_sincos;
int Bins = 4;
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++) {
m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
__asm movaps xmm0, m_sincos[t];
__asm mov eax, m_sincos
__asm mov ebx, t
__asm shl ebx, 4
__asm movaps xmm0, [eax+ebx];
}
return 0;
}
When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?
A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.
If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.
You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.