uint64 array to uint128 for SSE2 - c++

I have two similar issues when handling arrays when defined in the asm and when passed from c++ to asm. The code works fine inline but I need to separate them from the cpp into an asm file. The compiler may not throw an error or warning but the end result is random each run and should be constant like it was when inline.
The below code works when used in MMX (movq mm6,twosMask_W) but I need the equivalent for SSE2. I thought that this would work but I appear to be incorrect.
.data
align 16
twosMask_W qword 2 dup(0002000200020002h)
.code
...
movdqa xmm6,oword ptr twosMask_W
...
The second issue is when I pass my thresh128 array from C++ to asm (again for SSE2):
//C++
uint64_t thresh128[2];
thresh128[0] = ((thresh-1)<<8)+(thresh-1);
thresh128[0] += (thresh128[0]<<48)+(thresh128[0]<<32)+(thresh128[0]<<16);
thresh128[1] = thresh128[0];
sendToASM(thresh128)
//ASM
;There are more parameters that utilize the registers but not listed.
receivedFromCPP proc thresh:qword
public receivedFromCPP
...
movdqu xmm4,oword ptr thresh
...
I've tried having thresh as an oword parameter in the procedure but it yielded no results. I'm sure I've got some syntax or parameter type wrong. Any help would be greatly appreciated.
Note: Compiled using MASM in VS2013 for x86.

Well, I tested the first part and it seems to work - so I cannot say anything related to this particular issue.
Concerning the second problem: you seem to pass a 64 bit qword on the stack in 32 bit mode (where is no direct opcode for 64 bit PUSHes) so it would be 2 PUSHes...
receivedFromCPP proc thresh:qword
but are expecting a pointer to a 128 bit value on the stack:
movdqu xmm4,oword ptr thresh
Also keep in mind the little-endianess of x86 - depending on how the compiler chooses to PUSH the 2*64bit-array it may be different from a little-endian-value resulting in seemingly random values.
EDIT: Because the stack grows upside-down, a 128 bit value has to be PUSHed in reverse order for referencing it by EBP.

Related

GCC w/ inline assembly & -Ofast generating extra code for memory operand

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.
I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.
When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.
I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).
typedef uint64_t xmm2q __attribute__ ((vector_size (16)));
// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };
xmm2q mask2b(uint64_t mask) {
assert(mask < 4);
#ifdef USE_ASM
xmm2q result;
asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
return result;
#else
// bad cast (UB?), but input should be `uint16_t*` anyways
return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
#endif
}
Output assembly with -S (with USE_ASM and without):
__Z6mask2by: ## #_Z6mask2by
.cfi_startproc
## %bb.0:
leaq __ZL10MASK_TABLE(%rip), %rax
vpmovsxbq (%rax,%rdi,2), %xmm0
retq
.cfi_endproc
What I was expecting (I've removed all the extra stuff):
__Z6mask2by:
vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
retq
The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.
(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).
If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.
In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.
Intrinsics
Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.
*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.
In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.
The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV
I've written about this before:
Loading 8 chars from memory into an __m256 variable as packed single precision floats
How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
Your integer -> vector strategy
Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.
A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into
is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).
If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.
You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

How to get efficient asm for zeroing a tiny struct with MSVC++ for x86-32?

My project is compiled for 32-bit in both Windows and Linux. I have an 8-byte struct that's used just about everywhere:
struct Value {
unsigned char type;
union { // 4 bytes
unsigned long ref;
float num;
}
};
In a lot of places I need to zero out the struct, which is done like so:
#define NULL_VALUE_LITERAL {0, {0L}};
static const Value NULL_VALUE = NULL_VALUE_LITERAL;
// example of clearing a value
var = NULL_VALUE;
This however does not compile to the most efficient code in Visual Studio 2013, even with all optimizations on. What I see in the assembly is that the memory location for NULL_VALUE is being read, then written to the var. This results in two reads from memory and two writes to memory. This clearing however happens a lot, even in routines that are time-sensitive, and I'm looking to optimize.
If I set the value to NULL_VALUE_LITERAL, it's worse. The literal data, which again is all zeroes, is copied into temporary a stack value and THEN copied to the variable--even if the variable is also on the stack. So that's absurd.
There's also a common situation like this:
*pd->v1 = NULL_VALUE;
It has similar assembly code to the var=NULL_VALUE above, but it's something I can't optimize with inline assembly should I choose to go that route.
From my research the very, very fastest way to clear the memory would be something like this:
xor eax, eax
mov byte ptr [var], al
mov dword ptr [var+4], eax
Or better still, since the struct alignment means there's just junk for 3 bytes after the data type:
xor eax, eax
mov dword ptr [var], eax
mov dword ptr [var+4], eax
Can you think of any way I can get code similar to that, optimized to avoid the memory reads that are totally unnecessary?
I tried some other methods, which end up creating what I feel is overly bloated code writing a 32-bit 0 literal to the two addresses, but IIRC writing a literal to memory still isn't as fast as writing a register to memory. I'm looking to eke out any extra performance I can get.
Ideally I would also like the result to be highly readable. Your help is appreciated.
I'd recommend uint32_t or unsigned int for the union with float. long on Linux x86-64 is a 64-bit type, which is probably not what you want.
I can reproduce the missed-optimization with MSVC CL19 -Ox on the Godbolt compiler explorer for x86-32 and x86-64. Workarounds that work with CL19:
make type an unsigned int instead of char, so there's no padding in the struct, then assign from a literal {0, {0L}} instead of a static const Value object. (Then you get two mov-immediate stores: mov DWORD PTR [eax], 0 / mov DWORD PTR [eax+4], 0).
gcc also has struct-zeroing missed-optimizations with padding in structs, but not as bad as MSVC (Bug 82142). It just defeats merging into wider stores; it doesn't get gcc to create an object on the stack and copy from that.
std::memset: probably the best option, MSVC compiles it to a single 64-bit store using SSE2. xorps xmm0, xmm0 / movq QWORD PTR [mem], xmm0. (gcc -m32 -O3 compiles this memset to two mov-immediate stores.)
void arg_memset(Value *vp) {
memset(vp, 0, sizeof(gvar));
}
;; x86 (32-bit) MSVC -Ox
mov eax, DWORD PTR _vp$[esp-4]
xorps xmm0, xmm0
movq QWORD PTR [eax], xmm0
ret 0
This is what I'd choose for modern CPUs (Intel and AMD). The penalty for crossing a cache-line is low enough that it's worth saving an instruction if it doesn't happen all the time. xor-zeroing is extremely cheap (especially on Intel SnB-family).
IIRC writing a literal to memory still isn't as fast as writing a register to memory
In asm, constants embedded in the instruction are called immediate data. mov-immediate to memory is mostly fine on x86, but it's a bit bloated for code-size.
(x86-64 only): A store with a RIP-relative addressing mode and an immediate can't micro-fuse on Intel CPUs, so it's 2 fused-domain uops. (See Agner Fog's microarch pdf, and other links in the x86 tag wiki.) This means it's worth it (for front-end bandwidth) to zero a register if you're doing more than one store to a RIP-relative address. Other addressing modes do fuse, though, so it's just a code-size issue.
Related: Micro fusion and addressing modes (indexed addressing modes un-laminate on Sandybridge/Ivybridge, but Haswell and later can keep indexed stores micro-fused.) This isn't dependent on immediate vs. register source.
I think memset would be a very poor fit since this is just an 8-byte struct.
Modern compilers know what some heavily-used / important standard library functions do (memset, memcpy, etc.), and treat them like intrinsics. There's very little difference as far as optimization is concerned between a = b and memcpy(&a, &b, sizeof(a)) if they have the same type.
You might get a function call to the actual library implementation in debug mode, but debug mode is very slow anyway. If you have debug-mode perf requirements, that's unusual. (But does happen for code that needs to keep up with something else...)

Optimized code in VC++ and ASM

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.
As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.
No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

Is there a more direct method to convert float to int with rounding than adding 0.5f and converting with truncation?

Conversion from float to int with rounding happens fairly often in C++ code that works with floating point data. One use, for example, is in generating conversion tables.
Consider this snippet of code:
// Convert a positive float value and round to the nearest integer
int RoundedIntValue = (int) (FloatValue + 0.5f);
The C/C++ language defines the (int) cast as truncating, so the 0.5f must be added to ensure rounding up to the nearest positive integer (when the input is positive). For the above, VS2015's compiler generates the following code:
movss xmm9, DWORD PTR __real#3f000000 // 0.5f
addss xmm0, xmm9
cvttss2si eax, xmm0
The above works, but could be more efficient...
Intel's designers apparently thought it was important enough a problem to solve with a single instruction that will do just what's needed: Convert to the nearest integer value: cvtss2si (note, just one 't' in the mnemonic).
If the cvtss2si were to replace the cvttss2si instruction in the above sequence two of the three instructions would just be eliminated (as would the use of an extra xmm register, which could result in better optimization overall).
So how can we code C++ statement(s) to get this simple job done with the one cvtss2si instruction?
I've been poking around, trying things like the following but even with the optimizer on task it doesn't boil down to the one machine instruction that could/should do the job:
int RoundedIntValue = _mm_cvt_ss2si(_mm_set_ss(FloatValue));
Unfortunately the above seems bent on clearing out a whole vector of registers that will never be used, instead of just using the one 32 bit value.
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
Perhaps I'm missing an obvious approach here.
Can you offer a suggested set of C++ instructions that will ultimately generate the single cvtss2si instruction?
This is an optimization defect in Microsoft's compiler, and the bug has been reported to Microsoft. As other commentators have mentioned, modern versions of GCC, Clang, and ICC all produce the expected code. For a function like:
int RoundToNearestEven(float value)
{
return _mm_cvt_ss2si(_mm_set_ss(value));
}
all compilers but Microsoft's will emit the following object code:
cvtss2si eax, xmm0
ret
whereas Microsoft's compiler (as of VS 2015 Update 3) emits the following:
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
ret
The same is seen for the double-precision version, cvtsd2si (i.e., the _mm_cvtsd_si32 intrinsic).
Until such time as the optimizer is improved, there is no faster alternative available. Fortunately, the code currently being generated is not as slow as it might seem. Moving and register-clearing are among the fastest possible instructions, and several of these can probably be implemented solely in the front end as register renames. And it is certainly faster than any of the possible alternatives—often by orders of magnitude:
The trick of adding 0.5 that you mentioned will not only be slower because it has to load a constant and perform an addition, it will also not produce the correctly rounded result in all cases.
Using the _mm_load_ss intrinsic to load the floating-point value into an __m128 structure suitable to be used with the _mm_cvt_ss2si intrinsic is a pessimization because it causes a spill to memory, rather than just a register-to-register move.
(Note that while _mm_set_ss is always better for x86-64, where the calling convention uses SSE registers to pass floating-point values, I have occasionally observed that _mm_load_ss will produce more optimal code in x86-32 builds than _mm_set_ss, but it is highly dependent upon multiple factors and has only been observed when multiple intrinsics are used in a complicated sequence of code. Your default choice should be _mm_set_ss.)
Substituting a reinterpret_cast<__m128&>(value) (or moral equivalent) for the _mm_set_ss intrinsic is both unsafe and inefficient. It results in a spill from the SSE register to memory; the cvtss2si instruction then uses that memory location as its source operand.
Declaring a temporary __m128 structure and value-initializing it is safe, but even more inefficient. Space is allocated on the stack for the entire structure, and then each slot is filled with either 0 or the floating-point value. This structure's memory location is then used as the source operand for cvtss2si.
The lrint family of functions provided by the C standard library should do what you want, and in fact compile to straightforward cvt* instructions on some other compilers, but are extremely sub-optimal on Microsoft's compiler. They are never inlined, so you always pay the cost of a function call. Plus, the code inside of the function is sub-optimal. Both of these have been reported as bugs, but we are still awaiting a fix. There are similar problems with other conversion functions provided by the standard library, including lround and friends.
The x87 FPU offers a FIST/FISTP instruction that performs a similar task, but the C and C++ language standards require that a cast truncate, rather than round-to-nearest-even (the default FPU rounding mode), so the compiler is obligated to insert a bunch of code to change the current rounding mode, perform the conversion, and then change it back. This is extremely slow, and there's no way to instruct the compiler not to do it except by using inline assembly. Beyond the fact that inline assembly is not available with the 64-bit compiler, MSVC's inline assembly syntax also offers no way to specify inputs and outputs, so you pay double load and store penalties both ways. And even if this weren't the case, you'd still have to pay the cost of copying the floating-point value from an SSE register, into memory, and then onto the x87 FPU stack.
Intrinsics are great, and can often allow you to produce code that is faster than what would otherwise be generated by the compiler, but they are not perfect. If you're like me and find yourself frequently analyzing the disassembly for your binaries, you will find yourself frequently disappointed. Nevertheless, your best choice here is to use the intrinsic.
As for why the optimizer emits the code in the way that it does, I can only speculate since I don't work on the Microsoft compiler team, but my guess would be because a number of the other cvt* instructions have false dependencies that the code-generator needs to work around. For example, a cvtss2sd does not modify the upper 64 bits of the destination XMM register. Such partial register updates cause stalls and reduce the opportunity for instruction-level parallelism. This is especially a problem in loops, where the upper bits of the register form a second loop-carried dependency chain, even though we don't actually care about their contents. Because execution of the cvtss2sd instruction cannot begin until the preceding instruction has completed, latency is vastly increased. However, by executing an xorss or movss instruction first, the register's upper bits are cleared, thus breaking dependencies and avoiding the possibility for a stall. This is an example of an interesting case where shorter code does not equate to faster code. The compiler team started inserting these dependency-breaking instructions for scalar conversions in the compiler shipped with VS 2010, and probably applied the heuristic overzealously.
Visual Studio 15.6, released today, appears to finally correct this issue. We now see a single instruction used when inlining this function:
inline int ConvertFloatToRoundedInt(float FloatValue)
{
return _mm_cvt_ss2si(_mm_set_ss(FloatValue)); // Convert to integer with rounding
}
I'm impressed that Microsoft finally got a round tuit.

Micro optimize pointer + unsigned + 1

Hard as it may be to believe the construct p[u+1] occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.
Typically *((p+u)+1) is most efficient. Sometimes *(p+(u+1)) is most efficient. Rarely *((p+1)+u) is best. (But usually an optimizer can convert *((p+1)+u) to *((p+u)+1) when the latter is better, and can't convert *(p+(u+1)) with either of the others).
p is a pointer and u is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.
In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.
In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u from 32-bit to 64-bit can affect the result.
What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?
In one Linux 64-bit compiler, I got nearly there with p[u+1L] which causes the compiler to select intelligently between the usually best *((p+u)+1) and the sometimes better *(p+( (long)(u) + 1) ). In the rare case *(p+(u+1)) was still better than the second of those, a little is lost.
Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL] is portable enough and good enough. But can I do better?
Note that using std::size_t instead of unsigned for u would eliminate this entire problem, but create a larger performance problem nearby. Casting u to std::size_t right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.
Simply coding (p+1)[u] makes a selection more likely to be optimal than p[u+1]. If the code were less templated and more stable, I could set them all to (p+1)[u] then profile then switch a few back to p[u+1]. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).
Compilers that should be efficient for this are GCC, ICC and MSVC.
The answer is inevitably compiler and target specific, but even if 1ULL is wider than a pointer on whatever target architecture, a good compiler should optimize it away. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? explains why a wider computation truncated to pointer width will give identical results as doing computation with pointer width in the first place. This is why compilers can optimize it away even on 32bit machines (or x86-64 with the x32 ABI) when 1ULL leads to promotion of the + operands to a 64bit type. (Or on some 64bit ABI for some architecture where long long is 128b).
1ULL looks optimal for 64bit, and for 32bit with clang. You don't care about 32bit anyway, but gcc wastes an instruction in the return p[u + 1ULL];. All the other cases are compiled to a single load with scaled-index+4+p addressing mode. So other than one compiler's optimization failure, 1ULL looks fine for 32bit as well. (I think it's unlikely that it's a clang bug and that optimization is illegal).
int v1ULL(std::uint32_t u) { return p[u + 1ULL]; }
// ... load u from the stack
// add eax, 1
// mov eax, DWORD PTR p[0+eax*4]
instead of
mov eax, DWORD PTR p[4+eax*4]
Interestingly, gcc 5.3 doesn't make this mistake when targeting the x32 ABI (long mode with 32bit pointers and a register-call ABI similar to SySV AMD64). It uses a 32bit address-size prefix to avoid using the upper 32b of edi.
Annoyingly, it still uses an address-size prefix when it could save a byte of machine code by using a 64bit effective address (when there's no chance of overflow/carry into the upper32 generating an address outside the low 4GiB). Passing the pointer by reference is a good example:
int x2 (char *&c) { return *c; }
// mov eax, DWORD PTR [edi] ; upper32 of rax is zero
// movsx eax, BYTE PTR [eax] ; could be byte [rax], saving one byte of machine code
Err, actually I forget. 32bit addresses might sign-extend to 64b, not zero-extend. If that's the case, it could have used movsx for the first instruction, too, but that would have cost a byte because movsx has a longer opcode than mov.
Anyway, x32 is still an interesting choice for pointer-heavy code that wants more registers and a nicer ABI, without the cache-miss hit of 8B pointers.
The 64bit asm has to zero the upper32 of the register holding the parameter (with mov edi,edi), but that goes away when inlining. Looking at godbolt output for tiny functions is a valid way to test this.
If we want to make doubly sure that the compiler isn't shooting itself in the foot and zeroing the upper32 when it should know it's already zero, we could make test functions with an arg passed by reference.
int v1ULL(const std::uint32_t &u) { return p[u + 1ULL]; }
// mov eax, DWORD PTR [rdi]
// mov eax, DWORD PTR p[4+rax*4]