_ftol2_sse, are there faster options?

_ftol2_sse, are there faster options? - c++

I have code which calls a lot of
int myNumber = (int)(floatNumber);
which takes up, in total, around 10% of my CPU time (according to profiler). While I could leave it at that, I wonder if there are faster options, so I tried searching around, and stumbled upon
http://devmaster.net/forums/topic/7804-fast-int-float-conversion-routines/
http://stereopsis.com/FPU.html
I tried implementing the Real2Int() function given there, but it gives me wrong results, and runs slower. Now I wonder, are there faster implementations to floor double / float values to integers, or is the SSE2 version as fast as it gets? The pages I found date back a bit, so it might just be outdated, and newer STL is faster at this.
The current implementation does:
013B1030 call _ftol2_sse (13B19A0h)
013B19A0 cmp dword ptr [___sse2_available (13B3378h)],0
013B19A7 je _ftol2 (13B19D6h)
013B19A9 push ebp
013B19AA mov ebp,esp
013B19AC sub esp,8
013B19AF and esp,0FFFFFFF8h
013B19B2 fstp qword ptr [esp]
013B19B5 cvttsd2si eax,mmword ptr [esp]
013B19BA leave
013B19BB ret
Related questions I found:
Fast float to int conversion and floating point precision on ARM (iPhone 3GS/4)
What is the fastest way to convert float to int on x86
Since both are old, or are ARM based, I wonder if there are current ways to do this. Note that it says the best conversion is one that doesn't happen, but I need to have it, so that will not be possible.

It's going to be hard to beat that if you are targeting generic x86 hardware. The runtime doesn't know for sure that the target machine has an SSE unit. If it did, it could do what the x64 compiler does and inline a cvttss2si opcode. But since the runtime has to check whether an SSE unit is available, you are left with the current implementation. That's what the implementation of ftol2_sse does. And what's more it passes the value in an x87 register and then transfers it to an SSE register if an SSE unit is available.
You could tell the x86 compiler to target machines that have SSE units. Then the compiler would indeed emit a simple cvttss2si opcode inline. That's going to be as fast as you can get. But if you run the code on an older machine then it will fail. Perhaps you could supply two versions, one for machines with SSE, and one for those without.
That's not going to gain you all that much. It's just going to avoid all the overhead of ftol2_sse that happens before you actually reach the cvttss2si opcode that does the work.
To change the compiler settings from the IDE, use Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set. On the command line it is /arch:SSE or /arch:SSE2.

For double I don't think you will be able to improve the results much but if you have a lot of floats to convert that using a packed conversion could help, the following is nasm code:
global _start
section .data
align 16
fv1: dd 1.1, 2.5, 2.51, 3.6
section .text
_start:
cvtps2dq xmm1, [fv1] ; Convert four 32-bit(single precision) floats to 32-bit(double word) integers and place the result in xmm1
There should be intrinsics code that allows you to do the same thing in an easier way but I am not as familiar with using intrinsics libraries. Although you are not using gcc this article Auto-vectorization with gcc 4.7 is an eye opener on how hard it can be to get the compiler to generate good vectorized code.

If you need speed and a large base of target machines, you'd better introduce a fast SSE version of all your algorithms, as well as a generic one -- and choose the algorithms to be executed at much higher level.
This would also mean that also the ABI is optimized for SSE; and that you can vectorize the calculation when available and that also the control logic is optimized for the architecture.
btw. even FLD; FIST sequence should take no longer than ~7 clock cycles on Pentium.

Related

Does any of current C++ compilers ever emit "rep movsb/w/d"?

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.
Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.
But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.
For this simple code, it could be reasonable to emit REP MOVSB:
void fn(char *dst, const char *src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?

GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy
takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte
I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).
Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).
gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)
void fn(char *__restrict dst, const char *__restrict src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.
But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:
fn:
test edx, edx
jle .L1 # rep movs treats the counter as unsigned, but the source uses signed
sub edx, 1 # what the heck, gcc? mov ecx,edx would be too easy?
lea ecx, [rdx+1]
rep movsb # dst=rdi and src=rsi
.L1: # matching the calling convention
ret
Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.
rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.
I also didn't check other compilers.
Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.
(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)
Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)
memmove / memcpy strategies:
BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.
Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.
For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.
rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.
Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.
It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.
Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)
It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).
On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)
The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.
The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).
Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.
Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)
Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.
IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)
I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.
A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

How to set MMX registers in a Windows exception handler to emulate unsupported 3DNow! instructions

I'm trying to revive an old Win32 game that uses 3DNow! instruction set to make 3D rendering.
On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed and the program throws an exception.
Since the number of 3DNow! instuctions used by the game is very limited, in my VS2008 MFC program I tried to use vectored exception handling to get the value of MMX registers, emulate the 3DNow! instructions by C code and push the values back to the processor 3DNow! registers.
So far I succeeded in first two steps (I get mmx register values from ExceptionInfo->ExtendedRegisters byte array at offset 32 and use float type C instructions to make calculations), but my problem is that, no matter how I try to update the MMX register values the register values seem to stay unchanged.
Assuming that my _asm statements might be wrong, I did also some minimal test using simple statements like this:
_asm movq mm0 mm7
This statement is executed without further exceptions, but when retrieving the MMX register values I still find that the original values were unchanged.
How can I make the assignment effective?

On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed
More likely your CPU doesn't support 3DNow! AMD dropped it for Bulldozer-family, and Intel never supported it. So unless you're running modern Windows on an Athlon64 / Phenom (or a Via C3), your CPU doesn't support it.
(Fun fact: PREFETCHW was originally a 3DNow! instruction, and is still supported (with its own CPUID feature bit). For a long time Intel CPUs ran it as a NOP, but Broadwell and later (IIRC) do actually prefetch a cache line into Exclusive state with a Read-For-Ownership.)
Unless this game only ever ran on AMD hardware, it must have a code path that avoids 3DNow. Fix its CPU detection to stop detecting your CPU as having 3DNow. (Maybe you have a recent AMD, and it assumes any AMD has 3DNow?)
(update on that: OP's comments say that the other code paths don't work for some reason. That's a problem.)
Returning from an exception handler probably restores registers from saved state, so it's not surprising that changing register values in the exception handler has no effect on the main program.
Apparently updating ExtendedRegisters in memory doesn't do the trick, though, so that's only a copy of the saved state.
The answer to modifying MMX registers from an exception handler is probably the same as for integer or XMM registers, so look up MS's documentation for that.
Alternative suggestion:
Rewrite the 3DNow code to use SSE2. (You said there's only a tiny amount of it?). SSE2 is baseline for x86-64, and generally safe to assume for 32-bit x86.
Without source, you could still modify the asm for the few functions that use 3DNow. You can literally just change the instructions to use 64-bit loads/stores into XMM registers instead of 3DNow! 64-bit loads/stores, and replace PFMUL with mulps, etc. (This could get slightly hairy if you run out of registers and the 3DNow code used a memory source operand. addps xmm0, [mem] requires 16B-aligned memory, and does a 16 byte load. So you may have to add a spill/reload to borrow another register as a temporary).
If you don't have room to rewrite the functions in-place, put in a jmp to somewhere you do have room to add new code.
Most of the 3DNow instructions have equivalents in SSE, but you may need some extra movaps instructions to copy registers around to implement PFCMPGE. If you can ignore the possibility of NaN, you can use cmpps with a not-less-than predicate. (Without AVX, SSE only has compare predicates based on less-than or not-less-than).
PFSUBR is easy to emulate with a spare register, just copy and subps to reverse. (Or SUBPS and invert the sign with XORPS). PFRCPIT1 (reciprocal-sqrt first iteration of refinement) and so on don't have a single-instruction implementation, but you can probably just use sqrtps and divps if you don't want to implement Newton-Raphson iterations with mulps and addps (or with AVX vfmadd). Modern CPUs are much faster than what this game was designed for.
You can load / store a pair of single-precision floats from/to memory into the bottom 64 bits of an XMM register using movsd (the SSE2 double-precision load/store instruction). You can also store a pair with movlps, but still use movsd for loading because it zeros the upper half instead of merging, so it doesn't have a dependency on the old value of the register.
Use movdq2q mm0, xmm0 and movq2dq xmm0, mm0 to move data between XMM and MMX.
Use movaps xmm1, xmm0 to copy registers, even if your data is only in the low half. (movsd xmm1, xmm0 merges the low half into the original high half. movq xmm1, xmm0 zeros the high half.)
addps and mulps work fine with zeros in the upper half. (They can slow down if any garbage (in the upper half) produces a denormal result, so prefer keeping the upper half zeroed). See http://felixcloutier.com/x86/ for an instruction-set reference (and other links in the x86 tag wiki.
Any shuffling of FP data can be done in XMM registers with shufps or pshufd instead of copying back to MMX registers to use whatever MMX shuffles.

Optimized code in VC++ and ASM

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.

As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.

No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

Is there a more direct method to convert float to int with rounding than adding 0.5f and converting with truncation?

Conversion from float to int with rounding happens fairly often in C++ code that works with floating point data. One use, for example, is in generating conversion tables.
Consider this snippet of code:
// Convert a positive float value and round to the nearest integer
int RoundedIntValue = (int) (FloatValue + 0.5f);
The C/C++ language defines the (int) cast as truncating, so the 0.5f must be added to ensure rounding up to the nearest positive integer (when the input is positive). For the above, VS2015's compiler generates the following code:
movss xmm9, DWORD PTR __real#3f000000 // 0.5f
addss xmm0, xmm9
cvttss2si eax, xmm0
The above works, but could be more efficient...
Intel's designers apparently thought it was important enough a problem to solve with a single instruction that will do just what's needed: Convert to the nearest integer value: cvtss2si (note, just one 't' in the mnemonic).
If the cvtss2si were to replace the cvttss2si instruction in the above sequence two of the three instructions would just be eliminated (as would the use of an extra xmm register, which could result in better optimization overall).
So how can we code C++ statement(s) to get this simple job done with the one cvtss2si instruction?
I've been poking around, trying things like the following but even with the optimizer on task it doesn't boil down to the one machine instruction that could/should do the job:
int RoundedIntValue = _mm_cvt_ss2si(_mm_set_ss(FloatValue));
Unfortunately the above seems bent on clearing out a whole vector of registers that will never be used, instead of just using the one 32 bit value.
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
Perhaps I'm missing an obvious approach here.
Can you offer a suggested set of C++ instructions that will ultimately generate the single cvtss2si instruction?

This is an optimization defect in Microsoft's compiler, and the bug has been reported to Microsoft. As other commentators have mentioned, modern versions of GCC, Clang, and ICC all produce the expected code. For a function like:
int RoundToNearestEven(float value)
{
return _mm_cvt_ss2si(_mm_set_ss(value));
}
all compilers but Microsoft's will emit the following object code:
cvtss2si eax, xmm0
ret
whereas Microsoft's compiler (as of VS 2015 Update 3) emits the following:
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
ret
The same is seen for the double-precision version, cvtsd2si (i.e., the _mm_cvtsd_si32 intrinsic).
Until such time as the optimizer is improved, there is no faster alternative available. Fortunately, the code currently being generated is not as slow as it might seem. Moving and register-clearing are among the fastest possible instructions, and several of these can probably be implemented solely in the front end as register renames. And it is certainly faster than any of the possible alternatives—often by orders of magnitude:
The trick of adding 0.5 that you mentioned will not only be slower because it has to load a constant and perform an addition, it will also not produce the correctly rounded result in all cases.
Using the _mm_load_ss intrinsic to load the floating-point value into an __m128 structure suitable to be used with the _mm_cvt_ss2si intrinsic is a pessimization because it causes a spill to memory, rather than just a register-to-register move.
(Note that while _mm_set_ss is always better for x86-64, where the calling convention uses SSE registers to pass floating-point values, I have occasionally observed that _mm_load_ss will produce more optimal code in x86-32 builds than _mm_set_ss, but it is highly dependent upon multiple factors and has only been observed when multiple intrinsics are used in a complicated sequence of code. Your default choice should be _mm_set_ss.)
Substituting a reinterpret_cast<__m128&>(value) (or moral equivalent) for the _mm_set_ss intrinsic is both unsafe and inefficient. It results in a spill from the SSE register to memory; the cvtss2si instruction then uses that memory location as its source operand.
Declaring a temporary __m128 structure and value-initializing it is safe, but even more inefficient. Space is allocated on the stack for the entire structure, and then each slot is filled with either 0 or the floating-point value. This structure's memory location is then used as the source operand for cvtss2si.
The lrint family of functions provided by the C standard library should do what you want, and in fact compile to straightforward cvt* instructions on some other compilers, but are extremely sub-optimal on Microsoft's compiler. They are never inlined, so you always pay the cost of a function call. Plus, the code inside of the function is sub-optimal. Both of these have been reported as bugs, but we are still awaiting a fix. There are similar problems with other conversion functions provided by the standard library, including lround and friends.
The x87 FPU offers a FIST/FISTP instruction that performs a similar task, but the C and C++ language standards require that a cast truncate, rather than round-to-nearest-even (the default FPU rounding mode), so the compiler is obligated to insert a bunch of code to change the current rounding mode, perform the conversion, and then change it back. This is extremely slow, and there's no way to instruct the compiler not to do it except by using inline assembly. Beyond the fact that inline assembly is not available with the 64-bit compiler, MSVC's inline assembly syntax also offers no way to specify inputs and outputs, so you pay double load and store penalties both ways. And even if this weren't the case, you'd still have to pay the cost of copying the floating-point value from an SSE register, into memory, and then onto the x87 FPU stack.
Intrinsics are great, and can often allow you to produce code that is faster than what would otherwise be generated by the compiler, but they are not perfect. If you're like me and find yourself frequently analyzing the disassembly for your binaries, you will find yourself frequently disappointed. Nevertheless, your best choice here is to use the intrinsic.
As for why the optimizer emits the code in the way that it does, I can only speculate since I don't work on the Microsoft compiler team, but my guess would be because a number of the other cvt* instructions have false dependencies that the code-generator needs to work around. For example, a cvtss2sd does not modify the upper 64 bits of the destination XMM register. Such partial register updates cause stalls and reduce the opportunity for instruction-level parallelism. This is especially a problem in loops, where the upper bits of the register form a second loop-carried dependency chain, even though we don't actually care about their contents. Because execution of the cvtss2sd instruction cannot begin until the preceding instruction has completed, latency is vastly increased. However, by executing an xorss or movss instruction first, the register's upper bits are cleared, thus breaking dependencies and avoiding the possibility for a stall. This is an example of an interesting case where shorter code does not equate to faster code. The compiler team started inserting these dependency-breaking instructions for scalar conversions in the compiler shipped with VS 2010, and probably applied the heuristic overzealously.

Visual Studio 15.6, released today, appears to finally correct this issue. We now see a single instruction used when inlining this function:
inline int ConvertFloatToRoundedInt(float FloatValue)
{
return _mm_cvt_ss2si(_mm_set_ss(FloatValue)); // Convert to integer with rounding
}
I'm impressed that Microsoft finally got a round tuit.

Why is strcmp not SIMD optimized?

I've tried to compile this program on an x64 computer:
#include <cstring>
int main(int argc, char* argv[])
{
return ::std::strcmp(argv[0],
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really long string"
);
}
I compiled it like this:
g++ -std=c++11 -msse2 -O3 -g a.cpp -o a
But the resulting disassembly is like this:
0x0000000000400480 <+0>: mov (%rsi),%rsi
0x0000000000400483 <+3>: mov $0x400628,%edi
0x0000000000400488 <+8>: mov $0x22d,%ecx
0x000000000040048d <+13>: repz cmpsb %es:(%rdi),%ds:(%rsi)
0x000000000040048f <+15>: seta %al
0x0000000000400492 <+18>: setb %dl
0x0000000000400495 <+21>: sub %edx,%eax
0x0000000000400497 <+23>: movsbl %al,%eax
0x000000000040049a <+26>: retq
Why is no SIMD used? I suppose it could be to compare, say, 16 chars at once. Should I write my own SIMD strcmp, or is it a nonsensical idea for some reason?

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte.
If you scan for the length of the string you have already accomplished most of the work of a strcmp function. Therefore there is no benefit to use SSE2.
However, Intel added instructions for string handling in the SSE4.2 instruction set. These handle the terminating zero byte problem. For a nice write-up on them read this blog-post:
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2

GCC in this case is using a builtin strcmp. If you want it to use the version from glibc use -fno-builtin. But you should not assume that GCC's builtin version of strcmp or glibc's implementaiton of strcmp are efficient. I know from experience that GCC's builtin memcpy and glibc's memcpy are not as efficient as they could be.
I suggest you look at Agner Fog's asmlib. He has optimized several of the standard library functions in assembly. See the file strcmp64.asm. This has two version: a generic version for CPUs without SSE4.2 and a version for CPUs with SSE4.2. Here is the main loop for the SSE4.2 version
compareloop:
add rax, 16 ; increment offset
movdqu xmm1, [rs1+rax] ; read 16 bytes of string 1
pcmpistri xmm1, [rs2+rax], 00011000B ; unsigned bytes, equal each, invert. returns index in ecx
jnbe compareloop ; jump if not carry flag and not zero flag
For the generic version he writes
This is a very simple solution. There is not much gained by using SSE2 or anything complicated
Here is the main loop of the generic version:
_compareloop:
mov al, [ss1]
cmp al, [ss2]
jne _notequal
test al, al
jz _equal
inc ss1
inc ss2
jmp _compareloop
I would compare the performance of GCC's builtin strcmp , GLIBC's strcmp and the asmlib strcmp. You should look at the disassembly to make sure that you get the builtin code. For example GCC's memcpy does not use the builtin version from sizes larger than 8192.
Edit:
In regards to the the string length, Agner's SSE4.2 version reads up to 15 bytes beyond the of the string. He argues this is rarely a problem since nothing is written. It's not a problem for stack allocated arrays. For statically allocated arrays it could be a problem for memory page boundaries. To get around this he adds 16 bytes to the .bss section after the .data section. For more details see the section 1.7 String instructions and safety precautions in the manaul of the asmlib.

When the standard library for C was designed, the implementations of string.h methods that were most efficient when dealing with large amounts of data would be reasonably efficient for small amounts, and vice versa. While there may be some string-comparison scenarios were sophisticated use of SIMD instructions could yield better performance than a "naive implementation", in many real-world scenarios the strings being compared will differ in the first few characters. In such situations, the naive implementation may yield a result in less time than a "more sophisticated" approach would spend deciding how the comparison should be performed. Note that even if SIMD code is able to process 16 bytes at a time and stop when a mismatch or end-of-string condition is detected, it would still have to do additional work equivalent to using the naive approach on the last 16 characters scanned. If many groups of 16 bytes match, being able to scan through them quickly may benefit performance. But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison.
Incidentally, another potential advantage of the "naive" approach is that it would be possible to define it inline as part of the header (or a compiler might regard itself as having special "knowledge" about it). Consider:
int strcmp(char *p1, char *p2)
{
int idx=0,t1,t2;
do
{
t1=*p1; t2=*p2;
if (t1 != t2)
{
if (t1 > t2) return 1;
return -1;
}
if (!t1)
return 0;
p1++; p2++;
} while(1);
}
...invoked as:
if (strcmp(p1,p2) > 0) action1();
if (strcmp(p3,p4) != 0) action2();
While the method would be a little big to be in-lined, in-lining could in the first case allow a compiler to eliminate the code to check whether the returned value was greater than zero, and in the second eliminate the code which checked whether t1 was greater than t2. Such optimization would not be possible if the method were dispatched via indirect jump.

Making an SSE2 version of strcmp was an interesting challenge for me.
I don't really like compiler intrinsic functions because of code bloat, so I decided to choose auto-vectorization approach. My approach is based on templates and approximates SIMD register as an array of words of different sizes.
I tried to write an auto-vectorizing implementation and test it with GCC and MSVC++ compilers.
So, what I learned is:
1. GCC's auto-vectorizer is good (awesome?)
2. MSVC's auto-vectorizer is worse than GCC's (doesn't vectorize my packing function)
3. All compilers declined to generate PMOVMSKB instruction, it is really sad
Results:
Version compiled by online-GCC gains ~40% with SSE2 auto-vectorization. On my Windows machine with Bulldozer-architecture CPU auto-vectorized code is faster than online compiler's and results match the native implementation of strcmp. But the best thing about the idea is that the same code can be compiled for any SIMD instruction set, at least on ARM & X86.
Note:
If anyone will find a way to make compiler to generate PMOVMSKB instruction then overall performance should get a significant boost.
Command-line options for GCC: -std=c++11 -O2 -m64 -mfpmath=sse -march=native -ftree-vectorize -msse2 -march=native -Wall -Wextra
Links:
Source code compiled by Coliru online compiler
Assembly + Source code (Compiler Explorer)
#PeterCordes, thanks for the help.

I suspect there's simply no point in SIMD versions of library functions with very little computation. I imagine that functions like strcmp, memcpy and similiar are actually limited by the memory bandwidth and not the CPU speed.

It depends on your implementation. On MacOS X, functions like memcpy, memmove and memset have implementations that are optimised depending on the hardware you are using (the same call will execute different code depending on the processor, set up at boot time); these implementations use SIMD and for big amounts (megabytes) use some rather fancy tricks to optimise cache usage. Nothing for strcpy and strcmp as far as I know.
Convincing the C++ standard library to use that kind of code is difficult.

AVX 2.0 would be faster actually
Edit: It is related to registers and IPC
Instead of relying on 1 big instruction, you can use a plethora of SIMD instructions with 16 registers of 32 bytes, well, in UTF16 you it gives you 265 chars to play with !
double that with avx512 in few years!
AVX instructions also do have high throughput.
According this blog: https://blog.cloudflare.com/improving-picohttpparser-further-with-avx2/
Today on the latest Haswell processors, we have the potent AVX2
instructions. The AVX2 instructions operate on 32 bytes, and most of
the boolean/logic instructions perform at a throughput of 0.5 cycles
per instruction. This means that we can execute roughly 22 AVX2
instructions in the same amount of time it takes to execute a single
PCMPESTRI. Why not give it a shot?
Edit 2.0
SSE/AVX units are power gated, and mixing SSE and/or AVX instructions with regular ones involves a context switch with performance penalty, that you should not have with the strcmp instruction.

I don't see the point in "optimizing" a function like strcmp.
You will need to find the length of the strings before applying any kind of parallel processing, which will force you to read the memory at least once. While you're at it, you might as well use the data to perform the comparison on the fly.
If you want to do anyting fast with strings, you will need specialized tools like finite state machines (lexx comes to mind for a parser).
As for C++ std::string, they are inefficient and slow for a large number of reasons, so the gain of checking length in comparisons is neglectible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js