Compiler explorer and GCC have different outputs - c++

I have some C code that when given to Compiler Explorer, it outputs:
mov BYTE PTR [rbp-4], al
mov eax, ecx
mov BYTE PTR [rbp-8], al
mov eax, edx
mov BYTE PTR [rbp-12], al
However if I use GCC or G++ then it gives me this:
mov BYTE PTR 16[rbp], al
mov eax, edx
mov BYTE PTR 24[rbp], al
mov eax, ecx
mov BYTE PTR 32[rbp], al
I have no idea why the BYTE PTRs are different. They have a completely wrong address and I don't get why they are before the [rdp] part.
If you know how to reproduce the first output using gcc or g++ please help!

gcc.exe (GCC) 8.2.0
Looks like GCC for the Windows x64 calling convention is using the shadow space (32 bytes above the return address) reserved by its caller. Godbolt's GCC installs target GNU/Linux, i.e. the x86-64 System V ABI.
You can get the same code on Godbolt by marking your function with __attribute__((ms_abi)). Of course that means your caller has to see that attribute in the prototype so it knows to reserve that space, and which registers to pass function args in.
The Windows x64 calling convention is mostly worse than x86-64 System V; fewer arg-passing registers for example. One of its only advantages is easier implementation of variadic functions (because of the shadow space), and having some call-preserved XMM regs. (Probably too many, but x86-64 SysV has zero.) So more likely you want to use a cross compiler (targeting GNU/Linux) on Windows, or use __attribute__((sysv_abi)) on all your functions. (https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html)
The XMM part of the calling convention is normally irrelevant for kernel code; most kernels avoid saving/restoring the SIMD/FPU state on kernel entry/exit by not letting the compiler use SIMD/FP instructions.

Related

asm inspection of c++ compiled object. what is the meaning of this cs: part [duplicate]

I am writing simple programs then analyze them.
Today I've written this:
#include <stdio.h>
int x;
int main(void){
printf("Enter X:\n");
scanf("%d",&x);
printf("You enter %d...\n",x);
return 0;
}
It's compiled into this:
push rbp
mov rbp, rsp
lea rdi, s ; "Enter X:"
call _puts
lea rsi, x
lea rdi, aD ; "%d"
mov eax, 0
call ___isoc99_scanf
mov eax, cs:x <- don't understand this
mov esi, eax
lea rdi, format ; "You enter %d...\n"
mov eax, 0
call _printf
mov eax, 0
pop rbp
retn
I don't understand what cs:x means.
I use Ubuntu x64, GCC 10.3.0, and IDA pro 7.6.
TL:DR: IDA confusingly uses cs: to indicate a RIP-relative addressing mode in 64-bit code.
In IDA mov eax, x means mov eax, DWORD [x] which in turn means reading a DWORD from the variable x.
For completeness, mov rax, OFFSET x means mov rax, x (i.e. putting the address of x in rax).
In 64-bit displacements are still 32-bit, so, for a Position Independent Executable, it's not always possible to address a variable by encoding its address (because it's 64-bit and it would not fit into a 32-bit field). And in position-independent code, it's not desirable.
Instead, RIP-relative addressing is used.
In NASM, RIP-relative addressing takes the form mov eax, [REL x], in gas it is mov x(%rip), %eax.
Also, in NASM, if DEFAULT REL is active, the instruction can be shortened to mov eax, [x] which is identical to the 32-bit syntax.
Each disassembler will disassemble a RIP-relative operand differently. As you commented, Ghidra gives mov eax, DWORD PTR [x].
IDA uses mov eax, cs:x to mean mov eax, [REL x]/mov x(%rip), %eax.
;IDA listing, 64-bit code
mov eax, x ;This is mov eax, [x] in NASM and most likely wrong unless your exec is not PIE and always loaded <= 4GiB
mov eax, cs:x ;This is mov eax, [REL x] in NASM and idiomatic to 64-bit programs
In short, you can mostly ignore the cs: because that's just the way variables are addressed in 64-bit mode.
Of course, as the listing above shows, the use or absence of RIP-relative addressing tells you the program can be loaded anywhere or just below the 4GiB.
The cs prefix shown by IDA threw me off.
I can see that it could mentally resemble "code" and thus the rip register but I don't think the RIP-relative addressing implies a cs segment override.
In 32-bit mode, the code segment is usually read-only, so an instruction like mov [cs:x], eax will fault.
In this scenario, putting a cs: in front of the operand would be wrong.
In 64-bit mode, segment overrides (other than fs/gs) are ignored (and the read-bit of the code segment is ignored anyway), so the presence of a cs: doesn't really matter because ds and cs are effectively indistinguishable. (Even an ss or ds override doesn't change the #GP or #SS exception for a non-canonical address.)
Probably the AGU doesn't even read the segment shadow registers anymore for segment bases other than fs or gs. (Although even in 32-bit mode, there's a lower latency fast path for the normal case of segment base = 0, so hardware may just let that do its job.)
Still cs: is misleading in my opinion - a 2E prefix byte is still possible in machine code as padding. Most tools still call it a CS prefix, although http://ref.x86asm.net/coder64.html calls it a "null prefix" in 64-bit mode. There's no such byte here, and cs: is not an obvious or clear way to imply RIP-relative addressing.

VS2022 MASM giving error "'ADDR32' relocation to 'lut' invalid without /LARGEADDRESSAWARE:NO [duplicate]

Running this code off my Mac computer, using command:
nasm -f macho64 -o max.a maximum.asm
This is the code I am attempting to run on my computer that finds the largest number inside an array.
section .data
data_items:
dd 3,67,34,222,45,75,54,34,44,33,22,11,66,0
section .text
global _start
_start:
mov edi, 0
mov eax, [data_items + edi*4]
mov ebx, eax
start_loop:
cmp eax, 0
je loop_exit
inc edi
mov eax, [data_items + edi*4]
cmp eax, ebx
jle start_loop
mov ebx, eax
jmp start_loop
loop_exit:
mov eax, 1
int 0x80
Error:
maximum.asm:14: error: Mach-O 64-bit format does not support 32-bit absolute addresses
maximum.asm:21: error: Mach-O 64-bit format does not support 32-bit absolute addresses
First of all, beware of NASM bugs with the macho64 output format with 64-bit absolute addressing (NASM 2.13.02+) and with RIP-relative in NASM 2.11.08. 64-bit absolute addressing is not recommended, so this answer should work even for buggy NASM 2.13.02 and higher. (The bugs don't cause this error, they lead to wrong addresses being used at runtime.)
[data_items + edi*4] is a 32-bit addressing mode. Even [data_items + rdi*4] can only use a 32-bit absolute displacement, so it wouldn't work either. Note that using an address as a 32-bit (sign-extended) immediate like cmp rdi, data_items is also a problem: only mov allows a 64-bit immediate.
64-bit code on OS X can't use 32-bit absolute addressing at all. Executables are loaded at a base address above 4GiB, so label addresses just plain don't fit in 32-bit integers, with zero- or sign-extension. RIP-relative addressing is the best / most efficient solution, whether you need it to be position-independent or not1.
In NASM, default rel at the top of your file will make all [] memory operands prefer RIP-relative addressing. See also Section 3.3 Effective Addresses in the NASM manual.
default rel ; near the top of file; affects all instructions
my_func:
...
mov ecx, [data_items] ; uses the default: RIP-relative
;mov ecx, [abs data_items] ; override to absolute [disp32], unusuable
mov ecx, [rel data_items] ; explicitly RIP-relative
But RIP-relative is only possible when there are no other registers involved, so for indexing a static array you need to get the address in a register first. Use a RIP-relative lea rsi, [rel data_items].
lea rsi, [data_items] ; can be outside the loop
...
mov eax, [rsi + rdi*4]
Or you could add rsi, 4 inside the loop and use a simpler addressing mode like mov eax, [rsi].
Note that mov rsi, data_items will work for getting an address into a register, but you don't want that because it's less efficient.
Technically, any address within +-2GiB of your array will work, so if you have multiple arrays you can address the others relative to one common base address, only tieing up one register with a pointer. e.g. lea rbx, [arr1] / ... / mov eax, [rbx + rdi*4 + arr2-arr1]. Relative Addressing errors - Mac 10.10 mentions that Agner Fog's "optimizing assembly" guide has some examples of array addressing, including one using the __mh_execute_header as a reference point. (The code in that question looks like another attempt to port this 32-bit Linux example from the PGU book to 64-bit OS X, at the same time as learning asm in the first place.)
Note that on Linux, position-dependent executables are loaded in the low 32 bits of virtual address space, so you will see code like mov eax, [array + rdi*4] or mov edi, symbol_name in Linux examples or compiler output on http://gcc.godbolt.org/. gcc -pie -fPIE will make position-independent executables on Linux, and is the default on many recent distros, but not Godbolt.
This doesn't help you on MacOS, but I mention it in case anyone's confused about code they've seen for other OSes, or why AMD64 architects bothered to allow [disp32] addressing modes at all on x86-64.
And BTW, prefer using 64-bit addressing modes in 64-bit code. e.g. use [rsi + rdi*4], not [esi + edi*4]. You usually don't want to truncate pointers to 32-bit, and it costs an extra address-size prefix to encode.
Similarly, you should be using syscall to make 64-bit system calls, not int 0x80. What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the differences in which registers to pass args in.
Footnote 1:
64-bit absolute addressing is supported on OS X, but only in position-dependent executables (non-PIE). This related question x64 nasm: pushing memory addresses onto the stack & call function includes an ld warning from using gcc main.o to link:
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not
allowed in code signed PIE, but used in _main from main.o. To fix this warning,
don't compile with -mdynamic-no-pic or link with -Wl,-no_pie
So the linker checks if any 64-bit absolute relocations are used, and if so disables creation of a Position-Independent Executable. A PIE can benefit from ASLR for security. I think shared-library code always has to be position-independent on OS X; I don't know if jump tables or other cases of pointers-as-data are allowed (i.e. fixed up by the dynamic linker), or if they need to be initialized at runtime if you aren't making a position-dependent executable.
mov r64, imm64 is larger (10 bytes) and not faster than lea r64, [RIP_rel32] (7 bytes).
So you could use mov rsi, qword data_items instead of a RIP-relative LEA which runs about as fast, and takes less space in code caches and the uop cache. 64-bit immediates also have a uop-cache fetch penalty for on Sandybridge-family (http://agner.org/optimize/): they take 2 cycles to read from a uop cache line instead of 1.
x86 also has a form of mov that loads/store from/to a 64-bit absolute address, but only for AL/AX/EAX/RAX. See http://felixcloutier.com/x86/MOV.html. You don't want this either, because it's larger and not faster than mov eax, [rel foo].
(Related: an AT&T syntax version of the same question)

Order of assignment produces different assembly

This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?
#include <inttypes.h>
union pack {
struct {
int32_t i32;
int16_t i16;
};
void *ptr;
};
static_assert(sizeof(pack)==8, "what?");
void *bad(const int32_t i32, const int16_t i16) {
pack p;
p.i32 = i32;
p.i16 = i16;
return p.ptr;
}
void *good(const int32_t i32, const int16_t i16) {
pack p;
p.i16 = i16;
p.i32 = i32;
return p.ptr;
}
...
bad(int, short):
movzx eax, si
sal rax, 32
mov rsi, rax
mov eax, edi
or rax, rsi
ret
good(int, short):
movzx eax, si
mov edi, edi
sal rax, 32
or rax, rdi
ret
The compiler flags were -O3 -fno-rtti -std=c++14
This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)
# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
movzx esi, si # mov-elimination never works for 16->32 movzx
mov eax, edi # mov-elimination works between different regs
sal rsi, 32
or rax, rsi
ret
Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.
Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.
Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.
If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)
Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.
Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.
Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p. But fixing it with pack p = {.ptr=0}; doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).
Also note, both versions in the question are less efficient than possible:
(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)
mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.
hand_written:
movzx eax, si # 16->32 can't be eliminated, only 8->32 and 32->32 mov
shl rax, 32
mov ecx, edi # zero-extend into a different reg with 0 latency
or rax, rcx
ret
Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)
minimal_uops_worse_latency: # also more uops on AMD.
movzx eax, si
shl rdi, 32 # int32 bits to the top of RDI
shld rax, rdi, 32 # shift the high 32 bits of RDI into RAX.
ret
If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.
Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.
high_garbage:
shl rsi, 32 # leaving high garbage = incoming high half of ESI
mov eax, edi # zero-extend into RAX
or rax, rsi
ret
Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)

How to use processor instructions in C++ to implement fast arithmetic operations

I was working on the C++ implementation of Shamir's secret sharing scheme. I split the message into 8-bit chunks and on each performs corresponding arithmetic. The underlying finite field was Rijndael's finite field F_256 / (x^8 + x^4 + x^3 + x + 1).
I made a quick search if there is some well-known and spread library for Rijndael's finite field calculations (e. g. OpenSSL or similar), and didn't find any. So I implemented it from scratch, partly as a programming exercise.
A few days ago, however, a professor at our university mentioned following: "Modern processors support carry-less integer operations, so the characteristic-2 finite field multiplications run fast nowadays.".
Hence, since I know just little about hardware, assembler, and similar stuff, my question is: How do I actually use (in C++) all the modern processors' instructions when building crypto software - whether it is AES, SHA, arithmetic from above or whatever else? I can't find any satisfactory resources on that. My idea is to build a library containing both: "Modern-approach fast implementation" and fallback "pure C++ dependency-less code" and let the GNU Autoconf decide which one to use on each respective host. Any book/article/tutorial recommendation on this topic would be appreciated.
The question is quite broad because there are several ways you might access the power of the underlying hardware, so instead of one specific way here's a list of ways you can try to use all the modern processors' instructions:
Idiom Recognition
Write out the operation not offered directly in C++ in "long form" and hope your compiler recognizes it as an idiom for the underlying instruction you want. For example, you could write a variable rotate left of x by amount as (x << amount) | (x >> (32 - amount)) and all of gcc, clang and icc will recognize this as a rotate and issue the underlying rol instruction supported by x86.
Sometimes this technique puts you in a bit of an uncomfortable spot: the above C++ rotate implementation exhibits undefined behavior for amount == 0 (and also amount >= 32) since the result of a shift of 32 on a uint32_t is undefined, but the code actually produced by these compilers is just fine in that case. Still, having this lurking undefined behavior in your program is dangerous, and it probably won't run clear against ubsan and friends. The alternative safe version amount ? (x << amount) | (x >> (32 - amount)) : x; is only recognized by icc, but not by gcc or clang.
This approach tends to work for common idioms that map directly to assembly-level instructions that have been around for a while: rotates, bit tests and sets, multiplications with a wider result than inputs (e.g., multiplying two 32-bit values for a 64-bit result), conditional moves and so on, but is less likely to pick up bleeding edge instructions that might also be of interest to cryptography. For example, I'm quite sure no compiler will currently recognize an application of the AES instruction set extensions. It also works best on platforms that have received a lot of effort on the part of the compiler developers since each recognized idiom has to be added by hand.
I don't think this technique will work with your carry-less multiplication (PCLMULQDQ), but maybe one day (if you file an issue against the compilers)? It does work for other "crypt-interesting" functions though, including rotate.
Intrinsic Functions
As an extension compilers will often offer intrinsic functions which are not part of the language proper, but often map directly to an instruction offered by most hardware. Although it looks like a function call, the compiler generally just emits the single instruction needed at the place you call it.
GCC calls these built-in functions and you can find a list of generic ones here. For example, you can use the __builtin_popcnt call to emit the popcnt instruction, if the current target supports it. Man of the gcc builtins are also supported by icc and clang, and in this case all of gcc, clang and icc support this call and emit popcnt as long as the architecture (-march=Haswell)is set to Haswell. Otherwise, clang and icc inline a replacement version using some clever SWAR tricks, while gcc calls __popcountdi2 which is provided by the runtime1.
The list of intrinsics above are generic and generally offered on any platform the compilers support. You can also find platform specific instrinics, for example this list from gcc.
For x86 SIMD instructions specifically, Intel makes available a set of intrinsic functions declared headers covering their ISA extensions, e.g., by including #include <x86intrin.h>. These have wider support than the gcc instrinsics, e.g., they are supported by Microsoft's Visual Studio compiler suite. New instruction sets are usually added before chips that support them become available, so you can use these to access new instructions immediate on release.
Programming with SIMD intrinsic functions is kind of a halfway house between C++ and full assembly. The compiler still takes care of things like calling conventions and register allocation, and some optimization are made (especially for generating constants and other broadcasts) - but generally what you write is more or less what you get at the assembly level.
Inline Assembly
If your compiler offers it, you can use inline assembly to call whatever instructions you want2. This has a lot of similarities to using intrinsic functions, but with a somewhat higher level of difficulty and less opportunities for the optimizer to help you out. You should probably prefer intrinsic functions unless you have a specific reason for inline assembly. One example could be if the optimizer does a really bad job with intrinsics: you could use an inline assembly block to get exactly the code you want.
Out-of-line Assembly
You can also just write your entire kernel function in assembly, assembly it how you want, and then declare it extern "C" and call it from C++. This is similar to the inline assembly option, but works on compilers that don't support inline assembly (e.g., 64-bit Visual Studio). You can also use a different assembler if you want, which is especially convenient if you are targeting multiple C++ compilers since you can then use a single assembler for all of them.
You need to take care of the calling conventions youself, and other messy things like DWARF unwind info and Windows SEH handling.
For very short functions, this approach doesn't work well since the call overhead will likely be prohibitive3.
Auto-Vectorization4
If you want to write fast cryptography today for a CPU, you are pretty much going to be targeting mostly SIMD instructions. Most new algorithms designed with software implementation are also designed with vectorization in mind.
You can intrinsic functions or assembly to write SIMD code, but you can also write normal scalar code and rely on the auto-vectorizer. These got a bad name back in the early days of SIMD, and while they are still far from perfect they have come a long way.
Consider this simple function with takes payload and key byte array and xors key into payload:
void otp_scramble(uint8_t* payload, uint8_t* key, size_t n) {
for (size_t i = 0; i < n; i++) {
payload[i] ^= key[i];
}
}
This is a softball example, of course, but anyways gcc, clang and icc all vectorize this to something like this inner loop4:
movdqu xmm0, XMMWORD PTR [rdi+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
pxor xmm0, xmm1
movups XMMWORD PTR [rdi+rax], xmm0
It's using SSE instructions to load and xor 16 bytes at a time. The developer only has to reason about the simple scalar code, however!
One advantage of this approach versus intrinsics or assembly is that you aren't baking in the SIMD length of the instruction set at the source level. The same C++ code as above compiled with -march=haswell results in a loop like:
vmovdqu ymm1, YMMWORD PTR [rdi+rax]
vpxor ymm0, ymm1, YMMWORD PTR [rsi+rax]
vmovdqu YMMWORD PTR [rdi+rax], ymm0
It's using the AVX2 instructions available on Haswell to do 32-bytes at a time. If you compile with -march=skylake-avx512 clang uses 64-byte vxorps instructions on zmm registers (but gcc and icc stick with 32-byte inner loops). So in principle you can take some advantage of new ISA simply with a recompile.
A downside of auto-vectorizatoin is that it is fairly fragile. What auto-vectorizes on one compiler might not on another or even on another version of the same compiler. So you need to check you are getting the results you want. The auto-vectorizer is often working with less information than you have: it might not know that the input length is a multiple of some power or two or that the input pointers are aligned in a certain way. Sometimes you can communicate this information to the compiler, but sometimes you can't.
Sometimes the compiler makes "interesting" decisions when it vectorizes, such as a small not-unrolled body for the inner loop, but then a giant "intro" or "outro" handling odd iterations, like what gcc produces after the first loop shown above:
movzx ecx, BYTE PTR [rsi+rax]
xor BYTE PTR [rdi+rax], cl
lea rcx, [rax+1]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+1+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+2]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+2+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+3]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+3+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+4]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+4+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+5]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+5+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+6]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+6+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+7]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+7+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+8]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+8+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+9]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+9+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+10]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+10+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+11]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+11+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+12]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+12+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+13]
cmp rdx, rcx
jbe .L1
movzx r8d, BYTE PTR [rsi+13+rax]
xor BYTE PTR [rdi+rcx], r8b
lea rcx, [rax+14]
cmp rdx, rcx
jbe .L1
movzx eax, BYTE PTR [rsi+14+rax]
xor BYTE PTR [rdi+rcx], al
You probably have better things to spend your instruction cache on (and this is far from the worst I've seen: it's easy to get examples with several hundreds of instructions in the intro and outro parts).
Unfortunately, the vectorizer probably won't produce crypto-specific instructions like carry-less multiply. You could consider a mix of scalar code that gets vectorized and an intrinsic only for the instructions the compiler won't generate, but this is easier to suggest than actually do successfully. At that point you are probably better off writing your entire loop with intrinsics.
1 The advantage of the gcc approach here is that at runtime if the platform supports popcnt this call can resolve to an implementation that just uses a popcnt instruction, using the GNU IFUNC mechanism.
2 Assuming the underlying assembler supports it, but even if it doesn't you could just encode the raw instruction bytes in the inline assembly block.
3 The call overhead includes more than just the explicit costs of the call and ret and argument passing: it also includes the effect on the optimizer which can't optimize code as well in the caller around the function call since it has unknown side-effects.
4 In some ways, auto-vectorization could be seen as a special case of idiom recognition, but it is important enough and has enough unique considerations that it gets its own section here.
5 With minor differences: gcc is as shown, clang unrolled a bit, and icc used a load-op pxor rather than a separate load.

Variable references in Intel style inline assembly and AT&T style, C++

I need to compile some assembly code in both Visual Studio and an IDE using G++ 4.6.1. The -masm=intel flag works as long as I do not reference and address any variables, which however I need to do.
I considered using intrinsics, but the compiled assembly is not optimal at all (for instance I cannot define the sse-register to be used and thus no pipe optimization is possible).
Consider these parts of code (inte style assembly):
mov ecx, dword ptr [p_pXcoords]
mov edx, dword ptr [p_pYcoords]
movhpd xmm6, qword ptr [oAvgX]
movhpd xmm7, qword ptr [oAvgY]
movlpd xmm6, qword ptr [oAvgX]
movlpd xmm7, qword ptr [oAvgY]
where p_pXcoords and p_pYcoords are doublde* arrays and function parameters, oAvgX and oAvgY simpled double values.
Another line of code is this, which is in the middle of an assembly block:
movhpd xmm6, qword ptr [oAvgY]
in other words, I need to access variables and use them within specific sse registers in the middle of the code. How can I do this with AT & T syntax, best: can I do this with a g++ compiler using the -masm flag?
Is there any way at all using one assembly code for both VS and a g++ 4.6.1 based compiler
You can certainly tell GCC which SSE register to use for each variable:
register __m128i x asm("xmm6");
But I guess VS does not support that. (I am also a little surprised you need it for decent performance. Register assignment and instruction scheduling are two of the most basic things an optimizing compiler knows. You sure you enabled optimization :-) ?)
I would probably just write two functions, one using intrinsics, and one using asm for whichever compiler does not know how to schedule instructions properly.