Why is strcmp not SIMD optimized?

Why is strcmp not SIMD optimized? - c++

I've tried to compile this program on an x64 computer:
#include <cstring>
int main(int argc, char* argv[])
{
return ::std::strcmp(argv[0],
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really long string"
);
}
I compiled it like this:
g++ -std=c++11 -msse2 -O3 -g a.cpp -o a
But the resulting disassembly is like this:
0x0000000000400480 <+0>: mov (%rsi),%rsi
0x0000000000400483 <+3>: mov $0x400628,%edi
0x0000000000400488 <+8>: mov $0x22d,%ecx
0x000000000040048d <+13>: repz cmpsb %es:(%rdi),%ds:(%rsi)
0x000000000040048f <+15>: seta %al
0x0000000000400492 <+18>: setb %dl
0x0000000000400495 <+21>: sub %edx,%eax
0x0000000000400497 <+23>: movsbl %al,%eax
0x000000000040049a <+26>: retq
Why is no SIMD used? I suppose it could be to compare, say, 16 chars at once. Should I write my own SIMD strcmp, or is it a nonsensical idea for some reason?

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte.
If you scan for the length of the string you have already accomplished most of the work of a strcmp function. Therefore there is no benefit to use SSE2.
However, Intel added instructions for string handling in the SSE4.2 instruction set. These handle the terminating zero byte problem. For a nice write-up on them read this blog-post:
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2

GCC in this case is using a builtin strcmp. If you want it to use the version from glibc use -fno-builtin. But you should not assume that GCC's builtin version of strcmp or glibc's implementaiton of strcmp are efficient. I know from experience that GCC's builtin memcpy and glibc's memcpy are not as efficient as they could be.
I suggest you look at Agner Fog's asmlib. He has optimized several of the standard library functions in assembly. See the file strcmp64.asm. This has two version: a generic version for CPUs without SSE4.2 and a version for CPUs with SSE4.2. Here is the main loop for the SSE4.2 version
compareloop:
add rax, 16 ; increment offset
movdqu xmm1, [rs1+rax] ; read 16 bytes of string 1
pcmpistri xmm1, [rs2+rax], 00011000B ; unsigned bytes, equal each, invert. returns index in ecx
jnbe compareloop ; jump if not carry flag and not zero flag
For the generic version he writes
This is a very simple solution. There is not much gained by using SSE2 or anything complicated
Here is the main loop of the generic version:
_compareloop:
mov al, [ss1]
cmp al, [ss2]
jne _notequal
test al, al
jz _equal
inc ss1
inc ss2
jmp _compareloop
I would compare the performance of GCC's builtin strcmp , GLIBC's strcmp and the asmlib strcmp. You should look at the disassembly to make sure that you get the builtin code. For example GCC's memcpy does not use the builtin version from sizes larger than 8192.
Edit:
In regards to the the string length, Agner's SSE4.2 version reads up to 15 bytes beyond the of the string. He argues this is rarely a problem since nothing is written. It's not a problem for stack allocated arrays. For statically allocated arrays it could be a problem for memory page boundaries. To get around this he adds 16 bytes to the .bss section after the .data section. For more details see the section 1.7 String instructions and safety precautions in the manaul of the asmlib.

When the standard library for C was designed, the implementations of string.h methods that were most efficient when dealing with large amounts of data would be reasonably efficient for small amounts, and vice versa. While there may be some string-comparison scenarios were sophisticated use of SIMD instructions could yield better performance than a "naive implementation", in many real-world scenarios the strings being compared will differ in the first few characters. In such situations, the naive implementation may yield a result in less time than a "more sophisticated" approach would spend deciding how the comparison should be performed. Note that even if SIMD code is able to process 16 bytes at a time and stop when a mismatch or end-of-string condition is detected, it would still have to do additional work equivalent to using the naive approach on the last 16 characters scanned. If many groups of 16 bytes match, being able to scan through them quickly may benefit performance. But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison.
Incidentally, another potential advantage of the "naive" approach is that it would be possible to define it inline as part of the header (or a compiler might regard itself as having special "knowledge" about it). Consider:
int strcmp(char *p1, char *p2)
{
int idx=0,t1,t2;
do
{
t1=*p1; t2=*p2;
if (t1 != t2)
{
if (t1 > t2) return 1;
return -1;
}
if (!t1)
return 0;
p1++; p2++;
} while(1);
}
...invoked as:
if (strcmp(p1,p2) > 0) action1();
if (strcmp(p3,p4) != 0) action2();
While the method would be a little big to be in-lined, in-lining could in the first case allow a compiler to eliminate the code to check whether the returned value was greater than zero, and in the second eliminate the code which checked whether t1 was greater than t2. Such optimization would not be possible if the method were dispatched via indirect jump.

Making an SSE2 version of strcmp was an interesting challenge for me.
I don't really like compiler intrinsic functions because of code bloat, so I decided to choose auto-vectorization approach. My approach is based on templates and approximates SIMD register as an array of words of different sizes.
I tried to write an auto-vectorizing implementation and test it with GCC and MSVC++ compilers.
So, what I learned is:
1. GCC's auto-vectorizer is good (awesome?)
2. MSVC's auto-vectorizer is worse than GCC's (doesn't vectorize my packing function)
3. All compilers declined to generate PMOVMSKB instruction, it is really sad
Results:
Version compiled by online-GCC gains ~40% with SSE2 auto-vectorization. On my Windows machine with Bulldozer-architecture CPU auto-vectorized code is faster than online compiler's and results match the native implementation of strcmp. But the best thing about the idea is that the same code can be compiled for any SIMD instruction set, at least on ARM & X86.
Note:
If anyone will find a way to make compiler to generate PMOVMSKB instruction then overall performance should get a significant boost.
Command-line options for GCC: -std=c++11 -O2 -m64 -mfpmath=sse -march=native -ftree-vectorize -msse2 -march=native -Wall -Wextra
Links:
Source code compiled by Coliru online compiler
Assembly + Source code (Compiler Explorer)
#PeterCordes, thanks for the help.

I suspect there's simply no point in SIMD versions of library functions with very little computation. I imagine that functions like strcmp, memcpy and similiar are actually limited by the memory bandwidth and not the CPU speed.

It depends on your implementation. On MacOS X, functions like memcpy, memmove and memset have implementations that are optimised depending on the hardware you are using (the same call will execute different code depending on the processor, set up at boot time); these implementations use SIMD and for big amounts (megabytes) use some rather fancy tricks to optimise cache usage. Nothing for strcpy and strcmp as far as I know.
Convincing the C++ standard library to use that kind of code is difficult.

AVX 2.0 would be faster actually
Edit: It is related to registers and IPC
Instead of relying on 1 big instruction, you can use a plethora of SIMD instructions with 16 registers of 32 bytes, well, in UTF16 you it gives you 265 chars to play with !
double that with avx512 in few years!
AVX instructions also do have high throughput.
According this blog: https://blog.cloudflare.com/improving-picohttpparser-further-with-avx2/
Today on the latest Haswell processors, we have the potent AVX2
instructions. The AVX2 instructions operate on 32 bytes, and most of
the boolean/logic instructions perform at a throughput of 0.5 cycles
per instruction. This means that we can execute roughly 22 AVX2
instructions in the same amount of time it takes to execute a single
PCMPESTRI. Why not give it a shot?
Edit 2.0
SSE/AVX units are power gated, and mixing SSE and/or AVX instructions with regular ones involves a context switch with performance penalty, that you should not have with the strcmp instruction.

I don't see the point in "optimizing" a function like strcmp.
You will need to find the length of the strings before applying any kind of parallel processing, which will force you to read the memory at least once. While you're at it, you might as well use the data to perform the comparison on the fly.
If you want to do anyting fast with strings, you will need specialized tools like finite state machines (lexx comes to mind for a parser).
As for C++ std::string, they are inefficient and slow for a large number of reasons, so the gain of checking length in comparisons is neglectible.

Related

Why C-style Arrays performance in O3 is less than no optimization on Quick Bench?

Base on C-style Arrays vs std::vector using std::vector::at, std::vector::operator[], and iterators
I run the following benchmarks.
no optimization
https://quick-bench.com/q/LjybujMGImpATTjbWePzcb6xyck
O3
https://quick-bench.com/q/u5hnSy90ZRgJ-CQ75b1c1a_3BuY
From here, vectors definitely perform better in O3.
However, C-style Array is slower with -O3 than -O0
C-style (no opt) : about 2500
C-style (O3) : about 3000
I don't know what factors lead to this result. Maybe it's because the compiler is c++14?
(I'm not asking about std::vector relative to plain arrays, I'm just asking about plain arrays with/without optimization.)

Your -O0 code wasn't faster in an absolute sense, just as a ratio against an empty
for (auto _ : state) {} loop.
That also gets slower when optimization is disabled, because the state iterator functions don't inline. Check the asm for your own functions, and instead of an outer-loop counter in %rbx like:
# outer loop of your -O3 version
sub $0x1,%rbx
jne 407f57 <BM_map_c_array(benchmark::State&)+0x37>
RBX was originally loaded from 0x10(%rdi), from the benchmark::State& state function arg.
You instead get state counter updates in memory, like the following, plus a bunch of convoluted code that materializes a boolean in a register and then tests it again.
# part of the outer loop of your -O0 version
12.50% mov -0x8060(%rbp),%rax
25.00% sub $0x1,%rax
12.50% mov %rax,-0x8060(%rbp)
There are high counts on those instructions because the call map_c_array didn't inline, so most of the CPU time wasn't actually spent in this function itself. But of the time that was, about half was on these instructions. In an empty loop, or one that called an empty function (I'm not sure which Quick Bench is doing), that would still be the case.
Quick Bench does this to try to normalize things for whatever hardware its cloud VM ends up running on, with whatever competing load. Click the "About Quick Bench" in the dropdown at the top right.
And see the label on the graph: CPU time / Noop time. (When they say "Noop", they don't mean a nop machine instruction, they mean in a C++ sense.)
An empty loop with a loop counter runs about 6x slower when compiled with optimization disabled (bottlenecked on store-to-load forwarding latency of the loop counter), so your -O0 code is "only" a bit less than 6x slower, not exactly 6x slower.
With a counter in a register, modern x86 CPUs can run loops at 1 cycle per iteration, like looptop: dec %ebx / jnz looptop. dec has one cycle latency, vs. subtract or dec on a memory location being about 6 cycles since it includes the store/reload. (https://agner.org/optimize/ and https://uops.info/. Also
The performance of two scan functions (benchmarked without optimization; my answer explains that they bottleneck on store-forwarding latency.)
Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Adding a redundant assignment speeds up code when compiled without optimization (Intel Sandybridge-family store-forwarding has variable latency depending on how soon you try to reload).
With that bottleneck built-in to the baseline you're comparing against, it's normal that adding some array-access work inside a loop won't be as much slower as array access vs. an empty loop.

Because you aren't benchmarking what you think you're benchmarking. I bothered to look at your code, and found that you're trying to see how fast your CPU can advance the counter in a for loop while seeing how fast your data BUS can transfer data. Is this really something you need to worry about, like ever?
In general, benchmarks outside multi-thousand programs are worthless and will never be taken with a straight face by anyone even remotely experienced in programming, so stop doing that.

Does any of current C++ compilers ever emit "rep movsb/w/d"?

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.
Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.
But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.
For this simple code, it could be reasonable to emit REP MOVSB:
void fn(char *dst, const char *src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?

GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy
takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte
I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).
Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).
gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)
void fn(char *__restrict dst, const char *__restrict src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.
But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:
fn:
test edx, edx
jle .L1 # rep movs treats the counter as unsigned, but the source uses signed
sub edx, 1 # what the heck, gcc? mov ecx,edx would be too easy?
lea ecx, [rdx+1]
rep movsb # dst=rdi and src=rsi
.L1: # matching the calling convention
ret
Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.
rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.
I also didn't check other compilers.
Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.
(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)
Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)
memmove / memcpy strategies:
BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.
Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.
For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.
rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.
Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.
It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.
Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)
It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).
On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)
The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.
The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).
Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.
Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)
Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.
IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)
I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.
A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

Optimized code in VC++ and ASM

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.

As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.

No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

_ftol2_sse, are there faster options?

I have code which calls a lot of
int myNumber = (int)(floatNumber);
which takes up, in total, around 10% of my CPU time (according to profiler). While I could leave it at that, I wonder if there are faster options, so I tried searching around, and stumbled upon
http://devmaster.net/forums/topic/7804-fast-int-float-conversion-routines/
http://stereopsis.com/FPU.html
I tried implementing the Real2Int() function given there, but it gives me wrong results, and runs slower. Now I wonder, are there faster implementations to floor double / float values to integers, or is the SSE2 version as fast as it gets? The pages I found date back a bit, so it might just be outdated, and newer STL is faster at this.
The current implementation does:
013B1030 call _ftol2_sse (13B19A0h)
013B19A0 cmp dword ptr [___sse2_available (13B3378h)],0
013B19A7 je _ftol2 (13B19D6h)
013B19A9 push ebp
013B19AA mov ebp,esp
013B19AC sub esp,8
013B19AF and esp,0FFFFFFF8h
013B19B2 fstp qword ptr [esp]
013B19B5 cvttsd2si eax,mmword ptr [esp]
013B19BA leave
013B19BB ret
Related questions I found:
Fast float to int conversion and floating point precision on ARM (iPhone 3GS/4)
What is the fastest way to convert float to int on x86
Since both are old, or are ARM based, I wonder if there are current ways to do this. Note that it says the best conversion is one that doesn't happen, but I need to have it, so that will not be possible.

It's going to be hard to beat that if you are targeting generic x86 hardware. The runtime doesn't know for sure that the target machine has an SSE unit. If it did, it could do what the x64 compiler does and inline a cvttss2si opcode. But since the runtime has to check whether an SSE unit is available, you are left with the current implementation. That's what the implementation of ftol2_sse does. And what's more it passes the value in an x87 register and then transfers it to an SSE register if an SSE unit is available.
You could tell the x86 compiler to target machines that have SSE units. Then the compiler would indeed emit a simple cvttss2si opcode inline. That's going to be as fast as you can get. But if you run the code on an older machine then it will fail. Perhaps you could supply two versions, one for machines with SSE, and one for those without.
That's not going to gain you all that much. It's just going to avoid all the overhead of ftol2_sse that happens before you actually reach the cvttss2si opcode that does the work.
To change the compiler settings from the IDE, use Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set. On the command line it is /arch:SSE or /arch:SSE2.

For double I don't think you will be able to improve the results much but if you have a lot of floats to convert that using a packed conversion could help, the following is nasm code:
global _start
section .data
align 16
fv1: dd 1.1, 2.5, 2.51, 3.6
section .text
_start:
cvtps2dq xmm1, [fv1] ; Convert four 32-bit(single precision) floats to 32-bit(double word) integers and place the result in xmm1
There should be intrinsics code that allows you to do the same thing in an easier way but I am not as familiar with using intrinsics libraries. Although you are not using gcc this article Auto-vectorization with gcc 4.7 is an eye opener on how hard it can be to get the compiler to generate good vectorized code.

If you need speed and a large base of target machines, you'd better introduce a fast SSE version of all your algorithms, as well as a generic one -- and choose the algorithms to be executed at much higher level.
This would also mean that also the ABI is optimized for SSE; and that you can vectorize the calculation when available and that also the control logic is optimized for the architecture.
btw. even FLD; FIST sequence should take no longer than ~7 clock cycles on Pentium.

What C++ code compiles down to the x86 REP instruction?

I'm copying elements from one array to another in C++. I found the rep movs instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for nor while loops I tried compiled to a rep movs instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?

Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.
In almost every implementation, you will find that the memcpy() compiler intrinsic both is easier to use and runs faster.

Under MSVC there are the __movsxxx & __stosxxx intrinsics that will generate a REP prefixed instruction.
there is also a 'hack' to force intrinsic memset aka REP STOS under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that __stosxxx due to the fact the compiler can optimize it for constants and order it correctly.
#define memset(mem,fill,size) memset((DWORD*)mem,((fill) << 24|(fill) << 16|(fill) << 8|(fill)),size)
__forceinline void memset(DWORD* pStart, unsigned long dwFill, size_t nSize)
{
//credits to Nepharius for finding this
DWORD* pLast = pStart + (nSize >> 2);
while(pStart < pLast)
*pStart++ = dwFill;
if((nSize &= 3) == 0)
return;
if(nSize == 3)
{
(((WORD*)pStart))[0] = WORD(dwFill);
(((BYTE*)pStart))[2] = BYTE(dwFill);
}
else if(nSize == 2)
(((WORD*)pStart))[0] = WORD(dwFill);
else
(((BYTE*)pStart))[0] = BYTE(dwFill);
}
of course REP isn't always the best thing to use, imo your way better off using memcpy, it'll branch to either sse2 or REPS MOV based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...

If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.

REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.
But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.
Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!
Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.
(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)

I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).
Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.

On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.
In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js