Can a movss instruction be used to replace integer data?

Can a movss instruction be used to replace integer data? - c++

With the constraint that I can use only SSE and SSE2 instructions, I have a need to replace the least significant (0) element of a 4 element __m128i vector with the 0 element from another vector.
For floating point vectors, the task is simple - one can use the _mm_move_ss() intrinsic to cause the element to be replaced with the 0 element from another vector. It generates one movss instruction, so is quite efficient.
Using two casting intrinsics, it's possible to also convince the compiler to use a single SSE movss instruction to move integer data. The source code ends up looking like this:
__m128i NewVector = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(Take3FromThisVector),
_mm_castsi128_ps(Take1FromThisVector)));
It looks a bit messy, but with a suitable amount of commenting it can be acceptable, especially since it generates a minimum of instructions. In its typical use everything's optimized to be in xmm registers.
My question is this:
Since it's a movss instruction, where the "ss" implies single precision floating point, is it okay to have it move integer data that could potentially contain some "special" or "illegal" (for floating point) combo of bits in any of the vector positions?
The obvious alternative - which I also implemented and tested - is to AND the first vector with a mask, then OR in a second vector that contains just one value in the least significant element, with all the others being zero. As you can imagine, this generates more instructions.
I've tested the casting approach I showed above and it doesn't seem to cause any problems, but I note in particular that there's no intrinsic provided that does this same operation for integer data. It seems as though Intel would have provided one if it was just as good for integer data - e.g., _mm_move_epi32 or similar. And so I'm skeptical whether this is a good idea.
I did some searches, e.g., "can a movss instruction cause a floating point exception", but did not find any information that would answer my question.
Thanks in advance for knowledge you're willing to share.
-Noel

Yes, it's fine to use FP shuffles like movss xmm, xmm on integer data. The insn reference manual tells you that it can't raise FP numeric exceptions; only actual FP math instructions do that. So go ahead and cast.
There isn't even a bypass delay for using FP shuffles on integer data in most uarches (but there is extra latency for using integer shuffles between FP math instructions).
Agner Fog's "optimizing assembly" guide has a great section on what instructions are useful for different kinds of data movement (broadcasts, merging, etc.) See also the x86 tag wiki for more good links.
The reason there's no integer intrinsic is that the SSE2 movd integer instruction zeros the upper bytes of the destination, like movss used as a load, but unlike movss between registers.
Intel's vector instruction set known for its inconsistency and non-orthogonality, esp. the earliest versions (like SSE1). SSE4.1 filled many gaps, but there are still obvious missing pieces.

The types __m128 and __m128i are interchangeable. The main reason for the cast is to make your intentions clearer (and keep your compiler happy). The cast itself does not generate any extra assembly.
The _mm_move_ss operation is described directly in terms of which bits end up in your result.
If you end up with an invalid bit combination for single-precision floats, this will only be a problem if you try to use the resulting value in floating-point calculations.

Related

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

So far I have the following:
bool IsZero(const double x) {
return fabs(x) == +0.0;
}
Is this the fastest of correct ways to compare to exact 0, while both +0.0 and -0.0 are accepted?
If CPU-specific, lets consider x86-64. If compiler specific, lets consider MSVC++2017 toolset v141.

Since you said you want the fastest possible code, I'm going to make some important simplifying assumptions throughout this answer. These are legal, per the question. In particular, I'm assuming x86 and IEEE-754 representations of floating-point values. I'll also mention MSVC-specific quirks, where applicable, although the general discussion would apply to any compiler targeting this architecture.
The way you test whether a floating-point value is equal to zero is by testing all of its bits. If all of the bits are 0, then the value is zero. Actually, the value is +0.0. The sign bit can be either 0 or 1, since the representation allows such thing as positive and negative 0.0, as you mention in the question. But this difference doesn't actually exist (there's not really any such thing as +0.0 and −0.0), so what you really need is to test all bits except the sign bit.
This can be done quickly and efficiently with some bit-twiddling. On little-endian architectures like x86, the sign bit is the leading bit, so you simply shift it out and then test the remaining bits.
This trick is described by Agner Fog in his Optimizing Subroutines in Assembly Language. Specifically, example 17.4b (on page 156 in the current version).
For a single-precision floating-point value (i.e., float), which is 32-bits wide:
mov eax, DWORD PTR [floatingPointValue]
add eax, eax ; shift out the sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
Translating this into C code, you'd do something like:
const uint32_t bits = *(reinterpret_cast<uint32_t*>(&value));
return ((bits + bits) == 0);
Of course, this is formally unsafe because of the type punning. MSVC lets you get away with it, no problem. In fact, if you try to actually conform to the standard and play it safe, MSVC will tend to generate less efficient code, decreasing the effectiveness of this trick. If you want to do this safely, you'll need to verify the output of your compiler and make sure it's doing what you want. Some assertions are also recommended.
If you're okay with the unsafe nature of this approach, you will find that it is faster than a poorly-predicted conditional branch, so when you're dealing with random input values, it might be a performance win. For comparison purposes, here is what you'll see from MSVC if you just do a naive test for equality against 0.0:
;; assuming /arch:IA32, which is *not* the default in modern versions of MSVC
;; but necessary if you cannot assume SSE2 support
fld DWORD PTR [floatingPointValue]
fldz
fucompp
fnstsw ax
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
;; assuming /arch:SSE2, which *is* the default in modern versions of MSVC
movss xmm0, DWORD PTR [floatingPointValue]
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
Ugly, and potentially slow. There are branchless ways of doing this, but MSVC won't use them.
An obvious drawback to the "optimized" implementation described above is that it requires the floating-point value be loaded from memory in order to access its bits. There are no x87 instructions that can access the bits directly, and there's no way go directly from an x87 register to a GP register without going through memory. Since memory access is slow, this does incur a performance penalty, but in my tests, it's still faster than a mispredicted branch.
If you're using any of the standard calling conventions on 32-bit x86 (__cdecl, __stdcall, etc.), then all floating-point values are passed and returned in the x87 registers, so there's no difference in moving from an x87 register to a GP register versus moving from an x87 register to an SSE register.
The story is a bit different if you're targeting x86-64 or if you are using __vectorcall on x86-32. Then, you actually have floating-point values stored and passed in SSE registers, so you can take advantage of branchless SSE instructions. At least, theoretically. MSVC won't, unless you hold its hand. It would normally do the same branching comparison shown above, just without the extra memory load:
;; MSVC output for a __vectorcall function, targeting x86-32 with /arch:SSE2
;; and/or for x86-64 (which always uses a vector calling convention and SSE2)
;; The floating point value being compared is passed directly in XMM0
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
I've demonstrated the compiler output for a very simple bool IsZero(float val) function, but in my observations, MSVC always emits a UCOMISS+JP sequence for this type of comparison, no matter how the comparison is incorporated into the input code. Again, fine if the zero-ness of the input is predictable, but relatively lousy if branch prediction fails.
If you want to ensure you get branchless code, avoiding the possibility of branch-misprediction stalls, then you need to use intrinsics to do the comparison. These intrinsics will force MSVC to emit code closer to what you would expect:
return (_mm_ucomieq_ss(_mm_set_ss(floatingPointValue), _mm_setzero_ps()) != 0);
Unfortunately, the output is still not perfect. You suffer from general optimization deficiencies surrounding the use of intrinsics—namely, some redundant shuffling of the input value between various SSE registers—but that is (A) unavoidable, and (B) not a measurable performance problem.
I'll note here that other compilers, like Clang and GCC, don't need their hands held. You can just do value == 0.0. The exact sequence of code that they emit varies, depending on your optimization settings, but you'll see either COMISS+SETE, UCOMISS+SETNP+CMOVNE or CMPEQSS+MOVD+NEG (the latter is used exclusively by ICC). Your attempting to hold their hands with intrinsics would almost certainly result in less efficient output, so this probably needs to be #ifdef'ed to limit it to MSVC.
That's single-precision values, which have a width of 32 bits. What about double-precision values, which are twice as long? You'd think these would have 63 bits to test (since the sign bit is still ignored), but there's a twist. If you can rule out the possibility of denormal numbers, then you can get away with testing only the upper bits (again, assuming little-endian).
Agner Fog discusses this as well (example 17.4d). If you exclude the possibility of denormal numbers, then a value of 0 corresponds to the case where the exponent bits are all 0. The upper bits are the sign bit and the exponent bits, so you can just test these exactly as you did for single-precision values:
mov eax, DWORD PTR [floatingPointValue+4] ; load upper bits only
add eax, eax ; shift out sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
In unsafe C:
const uint64_t bits = *(reinterpret_cast<uint64_t*>(&value);
const uint32_t upperBits = (bits & 0xFFFFFFFF00000000) >> 32;
return ((upperBits + upperBits) == 0);
If you do need to account for denormal values, then you aren't saving yourself anything. I haven't tested this, but you're probably no worse letting the compiler generate the code for a naive comparison. At least, not for x86-32. You might still gain on x86-64, where you have 64-bit-wide GP registers.
If you can assume SSE2 support (which would be all x86-64 systems, and all modern x86-32 builds as well), then you just use intrinsics, and you get denormal support for free (well, not really free; there are internal penalties in the CPU, I believe, but we'll ignore those):
return (_mm_ucomieq_sd(_mm_set_sd(floatingPointValue), _mm_setzero_pd()) != 0);
Again, as with single-precision values, the use of intrinsics is not necessary on compilers other than MSVC to get optimal code, and indeed may result in sub-optimal code, so should be avoided.

In plain and simple words, if you want to accept exactly +0.0 and -0.0, you just have to use:
x == 0.0
OR
From the cmath library you can use:
int fpclassify( double arg ) which will return "zero" for -0.0 or +0.0

If you open the assembler of the code you can find what kind of assembler instructions are used for different versions of your code. Having the assembler you can estimate which is better.
In GCC compiler you can keep intermediate files (including assembler version) by this way:
gcc -save-temps main.cpp

Optimized code in VC++ and ASM

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.

As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.

No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

Why is there no floating point intrinsic for `PSHUFD` instruction?

The task I'm facing is to shuffle one _m128 vector and store the result in the other one.
The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector:
_mm_shuffle_ps, which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move.
_mm_shuffle_epi32, which uses PSHUFD instruction that seems to do exactly what is expected here and can have better latency/throughput than SHUFPS.
The latter intrinsic however works with integer vectors (_m128i) and there seems to be no floating point counterpart, so using it with _m128 would require some ugly explicit casting. Also the fact that there is no such counterpart probably means that there is some proper reason for that, which I am not aware of.
The question is why is there no intrinsic to shuffle one floating point vector and store the result in another?
If _mm_shuffle_ps(x,x, ...) can generate PSHUFPD, can it be guaranteed?
If PSHUFD should not be used for floating point values, what is the reason for that?
Thank you!

Intrinsics are supposed to map one-to-one with instructions. It would be very undesirable for _mm_shuffle_ps to generate PSHUFD. It should always generate SHUFPS. The documentation does not suggest that there is a case where it would do otherwise.
There is a performance penalty on certain processors when data is cast to single- or double-precision floating-point. This is because the processor augments the SSE registers with internal registers containing the FP classification of the data, e.g. zero or NaN or infinity or normal. When switching types you incur a stall as it performs that step. I don't know if this is still true of modern processors, but you can consult the Intel Architecture Optimization manuals for that information.
SHUFPS is not significantly slower than PSHUFD on modern processors. According to Agner Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf), they have identical latency and throughput on Haswell (4th gen. Core i7). On Nehalem (1st gen. Core i7), they have identical latency, but PSHUFD has a throughput of 2/cycle and SHUFPS has a throughput of 1/cycle. So, you cannot say that one instruction should be preferred over the other across all processors, even if you ignore the performance penalty associated with switching types.
There is also a way to cast between __m128, __m128d, and __m128i: _mm_castXX_YY (https://software.intel.com/en-us/node/695375?language=es) where XX and YY are each one of ps, pd, or si128. For example, _mm_castps_pd(). This is really a bad idea because the processors on which PSHUFD is faster suffer from the performance penalty associated with switching back to FP afterward. In other words, there is no faster way to do a SHUFPS other than doing a SHUFPS.

Is there a more direct method to convert float to int with rounding than adding 0.5f and converting with truncation?

Conversion from float to int with rounding happens fairly often in C++ code that works with floating point data. One use, for example, is in generating conversion tables.
Consider this snippet of code:
// Convert a positive float value and round to the nearest integer
int RoundedIntValue = (int) (FloatValue + 0.5f);
The C/C++ language defines the (int) cast as truncating, so the 0.5f must be added to ensure rounding up to the nearest positive integer (when the input is positive). For the above, VS2015's compiler generates the following code:
movss xmm9, DWORD PTR __real#3f000000 // 0.5f
addss xmm0, xmm9
cvttss2si eax, xmm0
The above works, but could be more efficient...
Intel's designers apparently thought it was important enough a problem to solve with a single instruction that will do just what's needed: Convert to the nearest integer value: cvtss2si (note, just one 't' in the mnemonic).
If the cvtss2si were to replace the cvttss2si instruction in the above sequence two of the three instructions would just be eliminated (as would the use of an extra xmm register, which could result in better optimization overall).
So how can we code C++ statement(s) to get this simple job done with the one cvtss2si instruction?
I've been poking around, trying things like the following but even with the optimizer on task it doesn't boil down to the one machine instruction that could/should do the job:
int RoundedIntValue = _mm_cvt_ss2si(_mm_set_ss(FloatValue));
Unfortunately the above seems bent on clearing out a whole vector of registers that will never be used, instead of just using the one 32 bit value.
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
Perhaps I'm missing an obvious approach here.
Can you offer a suggested set of C++ instructions that will ultimately generate the single cvtss2si instruction?

This is an optimization defect in Microsoft's compiler, and the bug has been reported to Microsoft. As other commentators have mentioned, modern versions of GCC, Clang, and ICC all produce the expected code. For a function like:
int RoundToNearestEven(float value)
{
return _mm_cvt_ss2si(_mm_set_ss(value));
}
all compilers but Microsoft's will emit the following object code:
cvtss2si eax, xmm0
ret
whereas Microsoft's compiler (as of VS 2015 Update 3) emits the following:
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
ret
The same is seen for the double-precision version, cvtsd2si (i.e., the _mm_cvtsd_si32 intrinsic).
Until such time as the optimizer is improved, there is no faster alternative available. Fortunately, the code currently being generated is not as slow as it might seem. Moving and register-clearing are among the fastest possible instructions, and several of these can probably be implemented solely in the front end as register renames. And it is certainly faster than any of the possible alternatives—often by orders of magnitude:
The trick of adding 0.5 that you mentioned will not only be slower because it has to load a constant and perform an addition, it will also not produce the correctly rounded result in all cases.
Using the _mm_load_ss intrinsic to load the floating-point value into an __m128 structure suitable to be used with the _mm_cvt_ss2si intrinsic is a pessimization because it causes a spill to memory, rather than just a register-to-register move.
(Note that while _mm_set_ss is always better for x86-64, where the calling convention uses SSE registers to pass floating-point values, I have occasionally observed that _mm_load_ss will produce more optimal code in x86-32 builds than _mm_set_ss, but it is highly dependent upon multiple factors and has only been observed when multiple intrinsics are used in a complicated sequence of code. Your default choice should be _mm_set_ss.)
Substituting a reinterpret_cast<__m128&>(value) (or moral equivalent) for the _mm_set_ss intrinsic is both unsafe and inefficient. It results in a spill from the SSE register to memory; the cvtss2si instruction then uses that memory location as its source operand.
Declaring a temporary __m128 structure and value-initializing it is safe, but even more inefficient. Space is allocated on the stack for the entire structure, and then each slot is filled with either 0 or the floating-point value. This structure's memory location is then used as the source operand for cvtss2si.
The lrint family of functions provided by the C standard library should do what you want, and in fact compile to straightforward cvt* instructions on some other compilers, but are extremely sub-optimal on Microsoft's compiler. They are never inlined, so you always pay the cost of a function call. Plus, the code inside of the function is sub-optimal. Both of these have been reported as bugs, but we are still awaiting a fix. There are similar problems with other conversion functions provided by the standard library, including lround and friends.
The x87 FPU offers a FIST/FISTP instruction that performs a similar task, but the C and C++ language standards require that a cast truncate, rather than round-to-nearest-even (the default FPU rounding mode), so the compiler is obligated to insert a bunch of code to change the current rounding mode, perform the conversion, and then change it back. This is extremely slow, and there's no way to instruct the compiler not to do it except by using inline assembly. Beyond the fact that inline assembly is not available with the 64-bit compiler, MSVC's inline assembly syntax also offers no way to specify inputs and outputs, so you pay double load and store penalties both ways. And even if this weren't the case, you'd still have to pay the cost of copying the floating-point value from an SSE register, into memory, and then onto the x87 FPU stack.
Intrinsics are great, and can often allow you to produce code that is faster than what would otherwise be generated by the compiler, but they are not perfect. If you're like me and find yourself frequently analyzing the disassembly for your binaries, you will find yourself frequently disappointed. Nevertheless, your best choice here is to use the intrinsic.
As for why the optimizer emits the code in the way that it does, I can only speculate since I don't work on the Microsoft compiler team, but my guess would be because a number of the other cvt* instructions have false dependencies that the code-generator needs to work around. For example, a cvtss2sd does not modify the upper 64 bits of the destination XMM register. Such partial register updates cause stalls and reduce the opportunity for instruction-level parallelism. This is especially a problem in loops, where the upper bits of the register form a second loop-carried dependency chain, even though we don't actually care about their contents. Because execution of the cvtss2sd instruction cannot begin until the preceding instruction has completed, latency is vastly increased. However, by executing an xorss or movss instruction first, the register's upper bits are cleared, thus breaking dependencies and avoiding the possibility for a stall. This is an example of an interesting case where shorter code does not equate to faster code. The compiler team started inserting these dependency-breaking instructions for scalar conversions in the compiler shipped with VS 2010, and probably applied the heuristic overzealously.

Visual Studio 15.6, released today, appears to finally correct this issue. We now see a single instruction used when inlining this function:
inline int ConvertFloatToRoundedInt(float FloatValue)
{
return _mm_cvt_ss2si(_mm_set_ss(FloatValue)); // Convert to integer with rounding
}
I'm impressed that Microsoft finally got a round tuit.

Getting "actual" registers from MCInsts (x86)

I'm using llvm-mc with the goal of making a relatively smart disassembler (identifying and tracking locals, easily following branches, etc), and part of that is creating a string representation of the disassembled instructions.
When I started this, I expected that I would be able to relatively easily identify registers and values used by MCInsts and whip out another representation myself with which I could easily work with. However, after some investigation, I realized that the correlation between the operands shown with the textual representation of an instruction and the operands that are actually present within the MCInst object is fairly low. Here are a few examples (Intel syntax):
Moving, say, 11587 as a 32-bit immediate into eax would be done with the MOV32ri opcode. The textual representation would be mov eax, 11587. The corresponding MCInst would have two operands, a register and an immediate. This works for me. This is great.
Adding 11587 to eax would be done with the ADD32ri opcode. The textual representation would be add eax, 11587. However, this time, the corresponding MCInst has three operands: eax is there twice and the immediate is in the end. This isn't so great. I can assume that this is an artifact of the lowering process, that the first instance of eax is the destination register and that the second one is there to be the source (even though x86 does not distinguish between the two), and I can hack around that.
Moving a 32-bits eip-relative value to eax would be done with the MOV32ao32 opcode. The textual representation would be mov eax, dword ptr [11587]. In this case, the MCInst doesn't even have an operand for eax, it can only be inferred from the operand type present in the opcode name. I can hack around that too, but things are getting less and less pretty and I've only tested 5-6 different instructions out of the 1300+ that x86 supports.
Obviously, for the purpose of showing text, I could get the textual representation with an MCInstPrinter, but the mapping between what's shown there and what the MCInst has is still muddy.
Is there a straightforward way to tell which operands appears in the textual representation of an instruction?

Add having three arguments sounds like a compiler builder preference for Three address code is bleeding through, since there is no justification for that in Intel assembler. (you can't add and store to a different register with the ADD instruction, you can with LEA though).
The opcodes run into the hundreds if you count all extensions (like SSE, FPU etc), and worse there are multiple variants of an opcode due to addressing modes and prefixes.
The NASM assembler has some tables in the source that you could try to mine if your llvm-mc system doesn't provide the functionality.

The MC level is very low and the operand layout depends on the opcode. That said, there are mapping tables that tell you what is where. MCInstrDesc and MCOperandInfo will tell you which operands and sources and destinations, whether they are immediates, registers, etc. and a set of flags.
You'll also need to get familiar with MCRegisterClass and MCRegisterInfo and a bunch of other stuff. It's a complicated interface because the task of representing arbitrary target information is complicated.
I would look at the code for the various MC-based tools to get started. You shouldn't need your own representation, MC should have everything you need.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js