gcc inline asm x86 CPU flags as input dependency

gcc inline asm x86 CPU flags as input dependency - c++

I want to create a function for addition two 16-bit integers with overflow detection. I have generic variant written in portable c. But the generic variant is not optimal for x86 target, because CPU internally calculate overflow flag when execute ADD/SUB/etc. Of course, there is__builtin_add_overflow(), but in my case it generates some boilerplate.
So I write the following code:
#include <cstdint>
struct result_t
{
uint16_t src;
uint16_t dst;
uint8_t of;
};
static void add_u16_with_overflow(result_t& r)
{
char of, cf;
asm (
" addw %[dst], %[src] "
: [dst] "+mr"(r.dst)//, "=#cco"(of), "=#ccc"(cf)
: [src] "imr" (r.src)
: "cc"
);
asm (" seto %0 " : "=rm" (r.of) );
}
uint16_t test_add(uint16_t a, uint16_t b)
{
result_t r;
r.src = a;
r.dst = b;
add_u16_with_overflow(r);
add_u16_with_overflow(r);
return (r.dst + r.of); // use r.dst and r.of for prevent discarding
}
I've played with https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) and it results
test_add(unsigned short, unsigned short):
seto %al
movzbl %al, %eax
addw %si, %di
addw %si, %di
addl %esi, %eax
ret
So, seto %0 is reordered. It seems gcc think there is no dependency between two consequent asm() statements. And "cc" clobber doesn't have any effect for flags dependency.
I can't use volatile because seto %0 or whole function can be (and have to) optimized out if result (or some part of result) is not used.
I can add dependency for r.dst: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) );, and reordering will not happen. But it is not a "right thing", and compiler still can insert some code changes flags (but not changes r.dst) between add and seto statement.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?

I haven't looked at gcc's output for __builtin_add_overflow, but how bad is it? #David's suggestion to use it, and https://gcc.gnu.org/wiki/DontUseInlineAsm is usually good, especially if you're worried about how this will optimize. asm defeats constant propagation and some other things.
Also, if you are going to use ASM, note that att syntax is add %[src], %[dst] operand order. See the tag wiki for details, unless you're always going to build your code with -masm=intel.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
No. Put the flag-consuming instruction (seto) inside the same asm block as the flag-producing instruction. An asm statement can have an many input and output operands as you like, limited only by register-allocation difficulty (but multiple memory outputs can use the same base register with different offsets). Anyway, an extra write-only output on the statement containing the add isn't going to cause any inefficiency.
I was going to suggest that if you want multiple flag outputs from one instruction, use LAHF to Load AH from FLAGS. But that doesn't include OF, only the other condition codes. This is often inconvenient and seems like a bad design choice because there are some unused reserved bits in the low 8 of EFLAGS/RFLAGS, so OF could have been in the low 8 along with CF, SF, ZF, PF, and AF. But since that isn't the case, setc + seto are probably better than pushf / reload, but that is worth considering.
Even if there was syntax for flag-input (like there is for flag-output), there would be very little to gain from letting gcc insert some of its own non-flag-modifying instructions (like lea or mov) between your two separate asm statements.
You don't want them reordered or anything, so putting them in the same asm statement makes by far the most sense. Even on an in-order CPU, add is low latency so it's not a big bottleneck to put a dependent instruction right after it.
And BTW, a jcc might be more efficient if overflow is an error condition that doesn't happen normally. But unfortunately GNU C asm goto doesn't support output operands. You could take a pointer input and modify dst in memory (and use a "memory" clobber), but forcing a store/reload sucks more than using setc or seto to produce an input for a compiler-generated test/jnz.
If you didn't also need an output, you could put C labels on a return true and a return false statement, which (after inlining) would turn your code into a jcc to wherever the compiler wanted to lay out the branches of an if(). e.g. see how Linux does it: (with extra complicating factors in these two examples I found): setting up to patch the code after checking a CPU feature once at boot, or something with a section for a jump table in arch_static_branch.)

Related

Using AVX to xor two zmm (512 bit) registers

I would like to bit-wisr xor zmm0 with zmm1.
I read around the internet and tried:
asm volatile(
"vmovdqa64 (%0),%%zmm0;\n"
"vmovdqa64 (%1),%%zmm1;\n"
"vpxorq %%zmm1, %%zmm0;\n"
"vmovdqa64 %%zmm0,(%0);\n"
:: "r"(p_dst), "r" (p_src)
: );
But the compiler gives "Error: number of operands mismatch for `vpxorq'".
What am I doing wrong?

Inline asm for this is pointless (https://gcc.gnu.org/wiki/DontUseInlineAsm), and your code is unsafe and inefficient even if you fixed the syntax error by adding the 3rd operand.
Use the intrinsic _mm512_xor_epi64( __m512i a, __m512i b); as documented in Intel's asm manual entry for pxor. Look at the compiler-generated asm if you want to see how it's done.
Unsafe because you don't have a "memory" clobber to tell the compiler that you read/write memory, and you don't declare clobbers on zmm0 or zmm1.
And inefficient for many reasons, including forcing the addressing modes and not using a memory source operand. And not letting the compiler pick which registers to use.
Just fixing the asm syntax so it compiles will go from having an obvious compile-time bug to a subtle and dangerous runtime bug that might only be visible with optimization enabled.
See https://stackoverflow.com/tags/inline-assembly/info for more about inline asm. But again, there is basically zero reason to use it for most SIMD because you can get the compiler to make asm that's just as efficient as what you can do by hand, and more efficient than this.

Most AVX512 instructions use 3+ operands, i.e. you need to add additional operand - dst register (it can be the same as one of the other operands).
This is also true for AVX2 version, see https://www.felixcloutier.com/x86/pxor:
VPXOR ymm1, ymm2, ymm3/m256
VPXORD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst
Note, that the above is intel syntax and would roughly translate into *mm1 = *mm2 ^ **mm3, in your case I guess you wanted to use "vpxorq %%zmm1, %%zmm0, %%zmm0;\n"
Be advised, that using inline assembly is generally a bad practice reserved for really special occasions. SIMD programming is better (faster, easier) done by using intrinsics supported by all major compilers. You can browse them here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Does any of current C++ compilers ever emit "rep movsb/w/d"?

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.
Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.
But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.
For this simple code, it could be reasonable to emit REP MOVSB:
void fn(char *dst, const char *src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?

GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy
takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte
I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).
Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).
gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)
void fn(char *__restrict dst, const char *__restrict src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.
But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:
fn:
test edx, edx
jle .L1 # rep movs treats the counter as unsigned, but the source uses signed
sub edx, 1 # what the heck, gcc? mov ecx,edx would be too easy?
lea ecx, [rdx+1]
rep movsb # dst=rdi and src=rsi
.L1: # matching the calling convention
ret
Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.
rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.
I also didn't check other compilers.
Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.
(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)
Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)
memmove / memcpy strategies:
BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.
Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.
For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.
rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.
Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.
It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.
Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)
It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).
On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)
The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.
The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).
Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.
Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)
Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.
IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)
I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.
A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

Using inline assembly with serialization instructions

We consider that we are using GCC (or GCC-compatible) compiler on a X86_64 architecture, and that eax, ebx, ecx, edx and level are variables (unsigned int or unsigned int*) for input and output of the instruction (like here).
asm("CPUID":::);
asm volatile("CPUID":::);
asm volatile("CPUID":::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)::"memory");
asm volatile("CPUID":"=a"(eax):"0"(level):"memory");
asm volatile("CPUID"::"a"(level):"memory"); // Not sure of this syntax
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level));
I am not used to the inline assembly syntax, and I am wondering what would be the difference between all these calls, in a context where I just want to use CPUID as a serializing instruction (e.g. nothing will be done with the output of the instruction).
Can some of these calls lead to errors?
Which one(s) of these calls would be the most suited (given that I want the least overhead as possible, but at the same time the "strongest" serialization possible)?

First of all, lfence may be strongly serializing enough for your use-case, e.g. for rdtsc. If you care about performance, check and see if you can find evidence that lfence is strong enough (at least for your use-case). Possibly even using both mfence; lfence might be better than cpuid, if you want to e.g. drain the store buffer before an rdtsc.
But neither lfence nor mfence are serializing on the whole pipeline in the official technical-terminology meaning, which could matter for cross-modifying code - discarding instructions that might have been fetched before some stores from another core became visible.
2. Yes, all the ones that don't tell the compiler that the asm statement writes E[A-D]X are dangerous and will likely cause hard-to-debug weirdness. (i.e. you need to use (dummy) output operands or clobbers).
You need volatile, because you want the asm code to be executed for the side-effect of serialization, not to produce the outputs.
If you don't want to use the CPUID result for anything (e.g. do double duty by serializing and querying something), you should simply list the registers as clobbers, not outputs, so you don't need any C variables to hold the results.
// volatile is already implied because there are no output operands
// but it doesn't hurt to be explicit.
// Serialize and block compile-time reordering of loads/stores across this
asm volatile("CPUID"::: "eax","ebx","ecx","edx", "memory");
// the "eax" clobber covers RAX in x86-64 code, you don't need an #ifdef __i386__
I am wondering what would be the difference between all these calls
First of all, none of these are "calls". They're asm statements, and inline into the function where you use them. CPUID itself is not a "call" either, although I guess you could look at it as calling a microcode function built-in to the CPU. But by that logic, every instruction is a "call", e.g. mul rcx takes inputs in RAX and RCX, and returns in RDX:RAX.
The first three (and the later one with no outputs, just a level input) destroy RAX through RDX without telling the compiler. It will assume that those registers still hold whatever it was keeping in them. They're obviously unusable.
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory"); (the one without volatile) will optimize away if you don't use any of the outputs. And if you do use them, it can still be hoisted out of loops. A non-volatile asm statement is treated by the optimizer as a pure function with no side effects. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#index-asm-volatile
It has a memory clobber, but (I think) that doesn't stop it from optimizing away, it just means that if / when / where it does run, any variables it could possibly read / write are synced to memory, so memory contents match what the C abstract machine would have at that point. This may exclude locals that haven't had their address taken, though.
asm("" ::: "memory") is very similar to std::atomic_thread_fence(std::memory_order_seq_cst), but note that that asm statement has no outputs, and thus is implicitly volatile. That's why it isn't optimized away, not because of the "memory" clobber itself. A (volatile) asm statement with a memory clobber is a compiler barrier against reordering loads or stores across it.
The optimizer doesn't care at all what's inside the first string literal, only the constraints / clobbers, so asm volatile("anything" ::: register clobbers, "memory") is also a compile-time-only memory barrier. I assume this is what you want, to serialize some memory operations.
"0"(level) is a matching constraint for the first operand (the "=a"). You could equally have written "a"(level), because in this case the compiler doesn't have a choice of which register to select; the output constraint can only be satisfied by eax. You could also have used "+a"(eax) as the output operand, but then you'd have to set eax=level before the asm statement. Matching constraints instead of read-write operands are sometimes necessary for x87 stack stuff; I think that came up once in an SO question. But other than weird stuff like that, the advantage is being able to use different C variables for input and output, or not using a variable at all for the input. (e.g. a literal constant, or an lvalue (expression)).
Anyway, telling the compiler to provide an input will probably result in an extra instruction, e.g. level=0 would result in an xor-zeroing of eax. This would be a waste of an instruction if it didn't already need a zeroed register for anything earlier. Normally xor-zeroing an input would break a dependency on the previous value, but the whole point of CPUID here is that it's serializing, so it has to wait for all previous instructions to finish executing anyway. Making sure eax is ready early is pointless; if you don't care about the outputs, don't even tell the compiler your asm statement takes an input. Compilers make it difficult or impossible to use an undefined / uninitialized value with no overhead; sometimes leaving a C variable uninitialized will result in loading garbage from the stack, or zeroing a register, instead of just using a register without writing it first.

Why is strcmp not SIMD optimized?

I've tried to compile this program on an x64 computer:
#include <cstring>
int main(int argc, char* argv[])
{
return ::std::strcmp(argv[0],
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really long string"
);
}
I compiled it like this:
g++ -std=c++11 -msse2 -O3 -g a.cpp -o a
But the resulting disassembly is like this:
0x0000000000400480 <+0>: mov (%rsi),%rsi
0x0000000000400483 <+3>: mov $0x400628,%edi
0x0000000000400488 <+8>: mov $0x22d,%ecx
0x000000000040048d <+13>: repz cmpsb %es:(%rdi),%ds:(%rsi)
0x000000000040048f <+15>: seta %al
0x0000000000400492 <+18>: setb %dl
0x0000000000400495 <+21>: sub %edx,%eax
0x0000000000400497 <+23>: movsbl %al,%eax
0x000000000040049a <+26>: retq
Why is no SIMD used? I suppose it could be to compare, say, 16 chars at once. Should I write my own SIMD strcmp, or is it a nonsensical idea for some reason?

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte.
If you scan for the length of the string you have already accomplished most of the work of a strcmp function. Therefore there is no benefit to use SSE2.
However, Intel added instructions for string handling in the SSE4.2 instruction set. These handle the terminating zero byte problem. For a nice write-up on them read this blog-post:
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2

GCC in this case is using a builtin strcmp. If you want it to use the version from glibc use -fno-builtin. But you should not assume that GCC's builtin version of strcmp or glibc's implementaiton of strcmp are efficient. I know from experience that GCC's builtin memcpy and glibc's memcpy are not as efficient as they could be.
I suggest you look at Agner Fog's asmlib. He has optimized several of the standard library functions in assembly. See the file strcmp64.asm. This has two version: a generic version for CPUs without SSE4.2 and a version for CPUs with SSE4.2. Here is the main loop for the SSE4.2 version
compareloop:
add rax, 16 ; increment offset
movdqu xmm1, [rs1+rax] ; read 16 bytes of string 1
pcmpistri xmm1, [rs2+rax], 00011000B ; unsigned bytes, equal each, invert. returns index in ecx
jnbe compareloop ; jump if not carry flag and not zero flag
For the generic version he writes
This is a very simple solution. There is not much gained by using SSE2 or anything complicated
Here is the main loop of the generic version:
_compareloop:
mov al, [ss1]
cmp al, [ss2]
jne _notequal
test al, al
jz _equal
inc ss1
inc ss2
jmp _compareloop
I would compare the performance of GCC's builtin strcmp , GLIBC's strcmp and the asmlib strcmp. You should look at the disassembly to make sure that you get the builtin code. For example GCC's memcpy does not use the builtin version from sizes larger than 8192.
Edit:
In regards to the the string length, Agner's SSE4.2 version reads up to 15 bytes beyond the of the string. He argues this is rarely a problem since nothing is written. It's not a problem for stack allocated arrays. For statically allocated arrays it could be a problem for memory page boundaries. To get around this he adds 16 bytes to the .bss section after the .data section. For more details see the section 1.7 String instructions and safety precautions in the manaul of the asmlib.

When the standard library for C was designed, the implementations of string.h methods that were most efficient when dealing with large amounts of data would be reasonably efficient for small amounts, and vice versa. While there may be some string-comparison scenarios were sophisticated use of SIMD instructions could yield better performance than a "naive implementation", in many real-world scenarios the strings being compared will differ in the first few characters. In such situations, the naive implementation may yield a result in less time than a "more sophisticated" approach would spend deciding how the comparison should be performed. Note that even if SIMD code is able to process 16 bytes at a time and stop when a mismatch or end-of-string condition is detected, it would still have to do additional work equivalent to using the naive approach on the last 16 characters scanned. If many groups of 16 bytes match, being able to scan through them quickly may benefit performance. But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison.
Incidentally, another potential advantage of the "naive" approach is that it would be possible to define it inline as part of the header (or a compiler might regard itself as having special "knowledge" about it). Consider:
int strcmp(char *p1, char *p2)
{
int idx=0,t1,t2;
do
{
t1=*p1; t2=*p2;
if (t1 != t2)
{
if (t1 > t2) return 1;
return -1;
}
if (!t1)
return 0;
p1++; p2++;
} while(1);
}
...invoked as:
if (strcmp(p1,p2) > 0) action1();
if (strcmp(p3,p4) != 0) action2();
While the method would be a little big to be in-lined, in-lining could in the first case allow a compiler to eliminate the code to check whether the returned value was greater than zero, and in the second eliminate the code which checked whether t1 was greater than t2. Such optimization would not be possible if the method were dispatched via indirect jump.

Making an SSE2 version of strcmp was an interesting challenge for me.
I don't really like compiler intrinsic functions because of code bloat, so I decided to choose auto-vectorization approach. My approach is based on templates and approximates SIMD register as an array of words of different sizes.
I tried to write an auto-vectorizing implementation and test it with GCC and MSVC++ compilers.
So, what I learned is:
1. GCC's auto-vectorizer is good (awesome?)
2. MSVC's auto-vectorizer is worse than GCC's (doesn't vectorize my packing function)
3. All compilers declined to generate PMOVMSKB instruction, it is really sad
Results:
Version compiled by online-GCC gains ~40% with SSE2 auto-vectorization. On my Windows machine with Bulldozer-architecture CPU auto-vectorized code is faster than online compiler's and results match the native implementation of strcmp. But the best thing about the idea is that the same code can be compiled for any SIMD instruction set, at least on ARM & X86.
Note:
If anyone will find a way to make compiler to generate PMOVMSKB instruction then overall performance should get a significant boost.
Command-line options for GCC: -std=c++11 -O2 -m64 -mfpmath=sse -march=native -ftree-vectorize -msse2 -march=native -Wall -Wextra
Links:
Source code compiled by Coliru online compiler
Assembly + Source code (Compiler Explorer)
#PeterCordes, thanks for the help.

I suspect there's simply no point in SIMD versions of library functions with very little computation. I imagine that functions like strcmp, memcpy and similiar are actually limited by the memory bandwidth and not the CPU speed.

It depends on your implementation. On MacOS X, functions like memcpy, memmove and memset have implementations that are optimised depending on the hardware you are using (the same call will execute different code depending on the processor, set up at boot time); these implementations use SIMD and for big amounts (megabytes) use some rather fancy tricks to optimise cache usage. Nothing for strcpy and strcmp as far as I know.
Convincing the C++ standard library to use that kind of code is difficult.

AVX 2.0 would be faster actually
Edit: It is related to registers and IPC
Instead of relying on 1 big instruction, you can use a plethora of SIMD instructions with 16 registers of 32 bytes, well, in UTF16 you it gives you 265 chars to play with !
double that with avx512 in few years!
AVX instructions also do have high throughput.
According this blog: https://blog.cloudflare.com/improving-picohttpparser-further-with-avx2/
Today on the latest Haswell processors, we have the potent AVX2
instructions. The AVX2 instructions operate on 32 bytes, and most of
the boolean/logic instructions perform at a throughput of 0.5 cycles
per instruction. This means that we can execute roughly 22 AVX2
instructions in the same amount of time it takes to execute a single
PCMPESTRI. Why not give it a shot?
Edit 2.0
SSE/AVX units are power gated, and mixing SSE and/or AVX instructions with regular ones involves a context switch with performance penalty, that you should not have with the strcmp instruction.

I don't see the point in "optimizing" a function like strcmp.
You will need to find the length of the strings before applying any kind of parallel processing, which will force you to read the memory at least once. While you're at it, you might as well use the data to perform the comparison on the fly.
If you want to do anyting fast with strings, you will need specialized tools like finite state machines (lexx comes to mind for a parser).
As for C++ std::string, they are inefficient and slow for a large number of reasons, so the gain of checking length in comparisons is neglectible.

what will be the addressing mode in assembly code generated by the compiler here?

Suppose we've got two integer and character variables:
int adad=12345;
char character;
Assuming we're discussing a platform in which, length of an integer variable is longer than or equal to three bytes, I want to access third byte of this integer and put it in the character variable, with that said I'd write it like this:
character=*((char *)(&adad)+2);
Considering that line of code and the fact that I'm not a compiler or assembly expert, I know a little about addressing modes in assembly and I'm wondering the address of the third byte (or I guess it's better to say offset of the third byte) here would be within the instructions generated by that line of code themselves, or it'd be in a separate variable whose address (or offset) is within those instructions ?

The best thing to do in situations like this is to try it. Here's an example program:
int main(int argc, char **argv)
{
int adad=12345;
volatile char character;
character=*((char *)(&adad)+2);
return 0;
}
I added the volatile to avoid the assignment line being completely optimized away. Now, here's what the compiler came up with (for -Oz on my Mac):
_main:
pushq %rbp
movq %rsp,%rbp
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
xorl %eax,%eax
leave
ret
The only three lines that we care about are:
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
The movl is the initialization of adad. Then, as you can see, it reads out the 3rd byte of adad, and stores it back into memory (the volatile is forcing that store back).
I guess a good question is why does it matter to you what assembly gets generated? For example, just by changing my optimization flag to -O0, the assembly output for the interesting part of the code is:
movl $0x00003039,0xf8(%rbp)
leaq 0xf8(%rbp),%rax
addq $0x02,%rax
movzbl (%rax),%eax
movb %al,0xff(%rbp)
Which is pretty straightforwardly seen as the exact logical operations of your code:
Initialize adad
Take the address of adad
Add 2 to that address
Load one byte by dereferencing the new address
Store one byte into character
Various optimizations will change the output... if you really need some specific behaviour/addressing mode for some reason, you might have to write the assembly yourself.

Without knowing anything about the compiler and the underlying CPU architecture no definitive answer can be given. For example, not all CPU architectures allow the addressing of every arbitrary byte in memory (though I believe all the currently popular ones do): on a CPU that's word-addressed, instead of byte-addressed, what the compiler will generate is inevitably going to be the loading into some register of the whole word adad (presumably by an offset from a base pointer register, if the variable in question is on stack [1]), followed by shifting and masking to isolate the byte of interest.
[1] note that, without knowing what CPU architecture we're talking about and how the compiler uses it, we can't even say whether "load a word at a fixed offset from a base register" is something that's done inline within the instruction (as one might hope, and many popular architectures definitely do support;-) or needs separate address arithmetic in an auxiliary register.
IOW, whether it's a good idea or not, it's definitely possible to define a CPU architecture which cannot load / store registers except from other registers or memory addresses defined by other registers or constant, and some such architectures exist (though they may not be all that popular at this time;-).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js