Optimized code in VC++ and ASM

Optimized code in VC++ and ASM - c++

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.

As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.

No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

Related

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

So far I have the following:
bool IsZero(const double x) {
return fabs(x) == +0.0;
}
Is this the fastest of correct ways to compare to exact 0, while both +0.0 and -0.0 are accepted?
If CPU-specific, lets consider x86-64. If compiler specific, lets consider MSVC++2017 toolset v141.

Since you said you want the fastest possible code, I'm going to make some important simplifying assumptions throughout this answer. These are legal, per the question. In particular, I'm assuming x86 and IEEE-754 representations of floating-point values. I'll also mention MSVC-specific quirks, where applicable, although the general discussion would apply to any compiler targeting this architecture.
The way you test whether a floating-point value is equal to zero is by testing all of its bits. If all of the bits are 0, then the value is zero. Actually, the value is +0.0. The sign bit can be either 0 or 1, since the representation allows such thing as positive and negative 0.0, as you mention in the question. But this difference doesn't actually exist (there's not really any such thing as +0.0 and −0.0), so what you really need is to test all bits except the sign bit.
This can be done quickly and efficiently with some bit-twiddling. On little-endian architectures like x86, the sign bit is the leading bit, so you simply shift it out and then test the remaining bits.
This trick is described by Agner Fog in his Optimizing Subroutines in Assembly Language. Specifically, example 17.4b (on page 156 in the current version).
For a single-precision floating-point value (i.e., float), which is 32-bits wide:
mov eax, DWORD PTR [floatingPointValue]
add eax, eax ; shift out the sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
Translating this into C code, you'd do something like:
const uint32_t bits = *(reinterpret_cast<uint32_t*>(&value));
return ((bits + bits) == 0);
Of course, this is formally unsafe because of the type punning. MSVC lets you get away with it, no problem. In fact, if you try to actually conform to the standard and play it safe, MSVC will tend to generate less efficient code, decreasing the effectiveness of this trick. If you want to do this safely, you'll need to verify the output of your compiler and make sure it's doing what you want. Some assertions are also recommended.
If you're okay with the unsafe nature of this approach, you will find that it is faster than a poorly-predicted conditional branch, so when you're dealing with random input values, it might be a performance win. For comparison purposes, here is what you'll see from MSVC if you just do a naive test for equality against 0.0:
;; assuming /arch:IA32, which is *not* the default in modern versions of MSVC
;; but necessary if you cannot assume SSE2 support
fld DWORD PTR [floatingPointValue]
fldz
fucompp
fnstsw ax
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
;; assuming /arch:SSE2, which *is* the default in modern versions of MSVC
movss xmm0, DWORD PTR [floatingPointValue]
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
Ugly, and potentially slow. There are branchless ways of doing this, but MSVC won't use them.
An obvious drawback to the "optimized" implementation described above is that it requires the floating-point value be loaded from memory in order to access its bits. There are no x87 instructions that can access the bits directly, and there's no way go directly from an x87 register to a GP register without going through memory. Since memory access is slow, this does incur a performance penalty, but in my tests, it's still faster than a mispredicted branch.
If you're using any of the standard calling conventions on 32-bit x86 (__cdecl, __stdcall, etc.), then all floating-point values are passed and returned in the x87 registers, so there's no difference in moving from an x87 register to a GP register versus moving from an x87 register to an SSE register.
The story is a bit different if you're targeting x86-64 or if you are using __vectorcall on x86-32. Then, you actually have floating-point values stored and passed in SSE registers, so you can take advantage of branchless SSE instructions. At least, theoretically. MSVC won't, unless you hold its hand. It would normally do the same branching comparison shown above, just without the extra memory load:
;; MSVC output for a __vectorcall function, targeting x86-32 with /arch:SSE2
;; and/or for x86-64 (which always uses a vector calling convention and SSE2)
;; The floating point value being compared is passed directly in XMM0
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
I've demonstrated the compiler output for a very simple bool IsZero(float val) function, but in my observations, MSVC always emits a UCOMISS+JP sequence for this type of comparison, no matter how the comparison is incorporated into the input code. Again, fine if the zero-ness of the input is predictable, but relatively lousy if branch prediction fails.
If you want to ensure you get branchless code, avoiding the possibility of branch-misprediction stalls, then you need to use intrinsics to do the comparison. These intrinsics will force MSVC to emit code closer to what you would expect:
return (_mm_ucomieq_ss(_mm_set_ss(floatingPointValue), _mm_setzero_ps()) != 0);
Unfortunately, the output is still not perfect. You suffer from general optimization deficiencies surrounding the use of intrinsics—namely, some redundant shuffling of the input value between various SSE registers—but that is (A) unavoidable, and (B) not a measurable performance problem.
I'll note here that other compilers, like Clang and GCC, don't need their hands held. You can just do value == 0.0. The exact sequence of code that they emit varies, depending on your optimization settings, but you'll see either COMISS+SETE, UCOMISS+SETNP+CMOVNE or CMPEQSS+MOVD+NEG (the latter is used exclusively by ICC). Your attempting to hold their hands with intrinsics would almost certainly result in less efficient output, so this probably needs to be #ifdef'ed to limit it to MSVC.
That's single-precision values, which have a width of 32 bits. What about double-precision values, which are twice as long? You'd think these would have 63 bits to test (since the sign bit is still ignored), but there's a twist. If you can rule out the possibility of denormal numbers, then you can get away with testing only the upper bits (again, assuming little-endian).
Agner Fog discusses this as well (example 17.4d). If you exclude the possibility of denormal numbers, then a value of 0 corresponds to the case where the exponent bits are all 0. The upper bits are the sign bit and the exponent bits, so you can just test these exactly as you did for single-precision values:
mov eax, DWORD PTR [floatingPointValue+4] ; load upper bits only
add eax, eax ; shift out sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
In unsafe C:
const uint64_t bits = *(reinterpret_cast<uint64_t*>(&value);
const uint32_t upperBits = (bits & 0xFFFFFFFF00000000) >> 32;
return ((upperBits + upperBits) == 0);
If you do need to account for denormal values, then you aren't saving yourself anything. I haven't tested this, but you're probably no worse letting the compiler generate the code for a naive comparison. At least, not for x86-32. You might still gain on x86-64, where you have 64-bit-wide GP registers.
If you can assume SSE2 support (which would be all x86-64 systems, and all modern x86-32 builds as well), then you just use intrinsics, and you get denormal support for free (well, not really free; there are internal penalties in the CPU, I believe, but we'll ignore those):
return (_mm_ucomieq_sd(_mm_set_sd(floatingPointValue), _mm_setzero_pd()) != 0);
Again, as with single-precision values, the use of intrinsics is not necessary on compilers other than MSVC to get optimal code, and indeed may result in sub-optimal code, so should be avoided.

In plain and simple words, if you want to accept exactly +0.0 and -0.0, you just have to use:
x == 0.0
OR
From the cmath library you can use:
int fpclassify( double arg ) which will return "zero" for -0.0 or +0.0

If you open the assembler of the code you can find what kind of assembler instructions are used for different versions of your code. Having the assembler you can estimate which is better.
In GCC compiler you can keep intermediate files (including assembler version) by this way:
gcc -save-temps main.cpp

Is there a more direct method to convert float to int with rounding than adding 0.5f and converting with truncation?

Conversion from float to int with rounding happens fairly often in C++ code that works with floating point data. One use, for example, is in generating conversion tables.
Consider this snippet of code:
// Convert a positive float value and round to the nearest integer
int RoundedIntValue = (int) (FloatValue + 0.5f);
The C/C++ language defines the (int) cast as truncating, so the 0.5f must be added to ensure rounding up to the nearest positive integer (when the input is positive). For the above, VS2015's compiler generates the following code:
movss xmm9, DWORD PTR __real#3f000000 // 0.5f
addss xmm0, xmm9
cvttss2si eax, xmm0
The above works, but could be more efficient...
Intel's designers apparently thought it was important enough a problem to solve with a single instruction that will do just what's needed: Convert to the nearest integer value: cvtss2si (note, just one 't' in the mnemonic).
If the cvtss2si were to replace the cvttss2si instruction in the above sequence two of the three instructions would just be eliminated (as would the use of an extra xmm register, which could result in better optimization overall).
So how can we code C++ statement(s) to get this simple job done with the one cvtss2si instruction?
I've been poking around, trying things like the following but even with the optimizer on task it doesn't boil down to the one machine instruction that could/should do the job:
int RoundedIntValue = _mm_cvt_ss2si(_mm_set_ss(FloatValue));
Unfortunately the above seems bent on clearing out a whole vector of registers that will never be used, instead of just using the one 32 bit value.
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
Perhaps I'm missing an obvious approach here.
Can you offer a suggested set of C++ instructions that will ultimately generate the single cvtss2si instruction?

This is an optimization defect in Microsoft's compiler, and the bug has been reported to Microsoft. As other commentators have mentioned, modern versions of GCC, Clang, and ICC all produce the expected code. For a function like:
int RoundToNearestEven(float value)
{
return _mm_cvt_ss2si(_mm_set_ss(value));
}
all compilers but Microsoft's will emit the following object code:
cvtss2si eax, xmm0
ret
whereas Microsoft's compiler (as of VS 2015 Update 3) emits the following:
movaps xmm1, xmm0
xorps xmm2, xmm2
movss xmm2, xmm1
cvtss2si eax, xmm2
ret
The same is seen for the double-precision version, cvtsd2si (i.e., the _mm_cvtsd_si32 intrinsic).
Until such time as the optimizer is improved, there is no faster alternative available. Fortunately, the code currently being generated is not as slow as it might seem. Moving and register-clearing are among the fastest possible instructions, and several of these can probably be implemented solely in the front end as register renames. And it is certainly faster than any of the possible alternatives—often by orders of magnitude:
The trick of adding 0.5 that you mentioned will not only be slower because it has to load a constant and perform an addition, it will also not produce the correctly rounded result in all cases.
Using the _mm_load_ss intrinsic to load the floating-point value into an __m128 structure suitable to be used with the _mm_cvt_ss2si intrinsic is a pessimization because it causes a spill to memory, rather than just a register-to-register move.
(Note that while _mm_set_ss is always better for x86-64, where the calling convention uses SSE registers to pass floating-point values, I have occasionally observed that _mm_load_ss will produce more optimal code in x86-32 builds than _mm_set_ss, but it is highly dependent upon multiple factors and has only been observed when multiple intrinsics are used in a complicated sequence of code. Your default choice should be _mm_set_ss.)
Substituting a reinterpret_cast<__m128&>(value) (or moral equivalent) for the _mm_set_ss intrinsic is both unsafe and inefficient. It results in a spill from the SSE register to memory; the cvtss2si instruction then uses that memory location as its source operand.
Declaring a temporary __m128 structure and value-initializing it is safe, but even more inefficient. Space is allocated on the stack for the entire structure, and then each slot is filled with either 0 or the floating-point value. This structure's memory location is then used as the source operand for cvtss2si.
The lrint family of functions provided by the C standard library should do what you want, and in fact compile to straightforward cvt* instructions on some other compilers, but are extremely sub-optimal on Microsoft's compiler. They are never inlined, so you always pay the cost of a function call. Plus, the code inside of the function is sub-optimal. Both of these have been reported as bugs, but we are still awaiting a fix. There are similar problems with other conversion functions provided by the standard library, including lround and friends.
The x87 FPU offers a FIST/FISTP instruction that performs a similar task, but the C and C++ language standards require that a cast truncate, rather than round-to-nearest-even (the default FPU rounding mode), so the compiler is obligated to insert a bunch of code to change the current rounding mode, perform the conversion, and then change it back. This is extremely slow, and there's no way to instruct the compiler not to do it except by using inline assembly. Beyond the fact that inline assembly is not available with the 64-bit compiler, MSVC's inline assembly syntax also offers no way to specify inputs and outputs, so you pay double load and store penalties both ways. And even if this weren't the case, you'd still have to pay the cost of copying the floating-point value from an SSE register, into memory, and then onto the x87 FPU stack.
Intrinsics are great, and can often allow you to produce code that is faster than what would otherwise be generated by the compiler, but they are not perfect. If you're like me and find yourself frequently analyzing the disassembly for your binaries, you will find yourself frequently disappointed. Nevertheless, your best choice here is to use the intrinsic.
As for why the optimizer emits the code in the way that it does, I can only speculate since I don't work on the Microsoft compiler team, but my guess would be because a number of the other cvt* instructions have false dependencies that the code-generator needs to work around. For example, a cvtss2sd does not modify the upper 64 bits of the destination XMM register. Such partial register updates cause stalls and reduce the opportunity for instruction-level parallelism. This is especially a problem in loops, where the upper bits of the register form a second loop-carried dependency chain, even though we don't actually care about their contents. Because execution of the cvtss2sd instruction cannot begin until the preceding instruction has completed, latency is vastly increased. However, by executing an xorss or movss instruction first, the register's upper bits are cleared, thus breaking dependencies and avoiding the possibility for a stall. This is an example of an interesting case where shorter code does not equate to faster code. The compiler team started inserting these dependency-breaking instructions for scalar conversions in the compiler shipped with VS 2010, and probably applied the heuristic overzealously.

Visual Studio 15.6, released today, appears to finally correct this issue. We now see a single instruction used when inlining this function:
inline int ConvertFloatToRoundedInt(float FloatValue)
{
return _mm_cvt_ss2si(_mm_set_ss(FloatValue)); // Convert to integer with rounding
}
I'm impressed that Microsoft finally got a round tuit.

Can a movss instruction be used to replace integer data?

With the constraint that I can use only SSE and SSE2 instructions, I have a need to replace the least significant (0) element of a 4 element __m128i vector with the 0 element from another vector.
For floating point vectors, the task is simple - one can use the _mm_move_ss() intrinsic to cause the element to be replaced with the 0 element from another vector. It generates one movss instruction, so is quite efficient.
Using two casting intrinsics, it's possible to also convince the compiler to use a single SSE movss instruction to move integer data. The source code ends up looking like this:
__m128i NewVector = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(Take3FromThisVector),
_mm_castsi128_ps(Take1FromThisVector)));
It looks a bit messy, but with a suitable amount of commenting it can be acceptable, especially since it generates a minimum of instructions. In its typical use everything's optimized to be in xmm registers.
My question is this:
Since it's a movss instruction, where the "ss" implies single precision floating point, is it okay to have it move integer data that could potentially contain some "special" or "illegal" (for floating point) combo of bits in any of the vector positions?
The obvious alternative - which I also implemented and tested - is to AND the first vector with a mask, then OR in a second vector that contains just one value in the least significant element, with all the others being zero. As you can imagine, this generates more instructions.
I've tested the casting approach I showed above and it doesn't seem to cause any problems, but I note in particular that there's no intrinsic provided that does this same operation for integer data. It seems as though Intel would have provided one if it was just as good for integer data - e.g., _mm_move_epi32 or similar. And so I'm skeptical whether this is a good idea.
I did some searches, e.g., "can a movss instruction cause a floating point exception", but did not find any information that would answer my question.
Thanks in advance for knowledge you're willing to share.
-Noel

Yes, it's fine to use FP shuffles like movss xmm, xmm on integer data. The insn reference manual tells you that it can't raise FP numeric exceptions; only actual FP math instructions do that. So go ahead and cast.
There isn't even a bypass delay for using FP shuffles on integer data in most uarches (but there is extra latency for using integer shuffles between FP math instructions).
Agner Fog's "optimizing assembly" guide has a great section on what instructions are useful for different kinds of data movement (broadcasts, merging, etc.) See also the x86 tag wiki for more good links.
The reason there's no integer intrinsic is that the SSE2 movd integer instruction zeros the upper bytes of the destination, like movss used as a load, but unlike movss between registers.
Intel's vector instruction set known for its inconsistency and non-orthogonality, esp. the earliest versions (like SSE1). SSE4.1 filled many gaps, but there are still obvious missing pieces.

The types __m128 and __m128i are interchangeable. The main reason for the cast is to make your intentions clearer (and keep your compiler happy). The cast itself does not generate any extra assembly.
The _mm_move_ss operation is described directly in terms of which bits end up in your result.
If you end up with an invalid bit combination for single-precision floats, this will only be a problem if you try to use the resulting value in floating-point calculations.

Micro optimize pointer + unsigned + 1

Hard as it may be to believe the construct p[u+1] occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.
Typically *((p+u)+1) is most efficient. Sometimes *(p+(u+1)) is most efficient. Rarely *((p+1)+u) is best. (But usually an optimizer can convert *((p+1)+u) to *((p+u)+1) when the latter is better, and can't convert *(p+(u+1)) with either of the others).
p is a pointer and u is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.
In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.
In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u from 32-bit to 64-bit can affect the result.
What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?
In one Linux 64-bit compiler, I got nearly there with p[u+1L] which causes the compiler to select intelligently between the usually best *((p+u)+1) and the sometimes better *(p+( (long)(u) + 1) ). In the rare case *(p+(u+1)) was still better than the second of those, a little is lost.
Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL] is portable enough and good enough. But can I do better?
Note that using std::size_t instead of unsigned for u would eliminate this entire problem, but create a larger performance problem nearby. Casting u to std::size_t right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.
Simply coding (p+1)[u] makes a selection more likely to be optimal than p[u+1]. If the code were less templated and more stable, I could set them all to (p+1)[u] then profile then switch a few back to p[u+1]. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).
Compilers that should be efficient for this are GCC, ICC and MSVC.

The answer is inevitably compiler and target specific, but even if 1ULL is wider than a pointer on whatever target architecture, a good compiler should optimize it away. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? explains why a wider computation truncated to pointer width will give identical results as doing computation with pointer width in the first place. This is why compilers can optimize it away even on 32bit machines (or x86-64 with the x32 ABI) when 1ULL leads to promotion of the + operands to a 64bit type. (Or on some 64bit ABI for some architecture where long long is 128b).
1ULL looks optimal for 64bit, and for 32bit with clang. You don't care about 32bit anyway, but gcc wastes an instruction in the return p[u + 1ULL];. All the other cases are compiled to a single load with scaled-index+4+p addressing mode. So other than one compiler's optimization failure, 1ULL looks fine for 32bit as well. (I think it's unlikely that it's a clang bug and that optimization is illegal).
int v1ULL(std::uint32_t u) { return p[u + 1ULL]; }
// ... load u from the stack
// add eax, 1
// mov eax, DWORD PTR p[0+eax*4]
instead of
mov eax, DWORD PTR p[4+eax*4]
Interestingly, gcc 5.3 doesn't make this mistake when targeting the x32 ABI (long mode with 32bit pointers and a register-call ABI similar to SySV AMD64). It uses a 32bit address-size prefix to avoid using the upper 32b of edi.
Annoyingly, it still uses an address-size prefix when it could save a byte of machine code by using a 64bit effective address (when there's no chance of overflow/carry into the upper32 generating an address outside the low 4GiB). Passing the pointer by reference is a good example:
int x2 (char *&c) { return *c; }
// mov eax, DWORD PTR [edi] ; upper32 of rax is zero
// movsx eax, BYTE PTR [eax] ; could be byte [rax], saving one byte of machine code
Err, actually I forget. 32bit addresses might sign-extend to 64b, not zero-extend. If that's the case, it could have used movsx for the first instruction, too, but that would have cost a byte because movsx has a longer opcode than mov.
Anyway, x32 is still an interesting choice for pointer-heavy code that wants more registers and a nicer ABI, without the cache-miss hit of 8B pointers.
The 64bit asm has to zero the upper32 of the register holding the parameter (with mov edi,edi), but that goes away when inlining. Looking at godbolt output for tiny functions is a valid way to test this.
If we want to make doubly sure that the compiler isn't shooting itself in the foot and zeroing the upper32 when it should know it's already zero, we could make test functions with an arg passed by reference.
int v1ULL(const std::uint32_t &u) { return p[u + 1ULL]; }
// mov eax, DWORD PTR [rdi]
// mov eax, DWORD PTR p[4+rax*4]

_ftol2_sse, are there faster options?

I have code which calls a lot of
int myNumber = (int)(floatNumber);
which takes up, in total, around 10% of my CPU time (according to profiler). While I could leave it at that, I wonder if there are faster options, so I tried searching around, and stumbled upon
http://devmaster.net/forums/topic/7804-fast-int-float-conversion-routines/
http://stereopsis.com/FPU.html
I tried implementing the Real2Int() function given there, but it gives me wrong results, and runs slower. Now I wonder, are there faster implementations to floor double / float values to integers, or is the SSE2 version as fast as it gets? The pages I found date back a bit, so it might just be outdated, and newer STL is faster at this.
The current implementation does:
013B1030 call _ftol2_sse (13B19A0h)
013B19A0 cmp dword ptr [___sse2_available (13B3378h)],0
013B19A7 je _ftol2 (13B19D6h)
013B19A9 push ebp
013B19AA mov ebp,esp
013B19AC sub esp,8
013B19AF and esp,0FFFFFFF8h
013B19B2 fstp qword ptr [esp]
013B19B5 cvttsd2si eax,mmword ptr [esp]
013B19BA leave
013B19BB ret
Related questions I found:
Fast float to int conversion and floating point precision on ARM (iPhone 3GS/4)
What is the fastest way to convert float to int on x86
Since both are old, or are ARM based, I wonder if there are current ways to do this. Note that it says the best conversion is one that doesn't happen, but I need to have it, so that will not be possible.

It's going to be hard to beat that if you are targeting generic x86 hardware. The runtime doesn't know for sure that the target machine has an SSE unit. If it did, it could do what the x64 compiler does and inline a cvttss2si opcode. But since the runtime has to check whether an SSE unit is available, you are left with the current implementation. That's what the implementation of ftol2_sse does. And what's more it passes the value in an x87 register and then transfers it to an SSE register if an SSE unit is available.
You could tell the x86 compiler to target machines that have SSE units. Then the compiler would indeed emit a simple cvttss2si opcode inline. That's going to be as fast as you can get. But if you run the code on an older machine then it will fail. Perhaps you could supply two versions, one for machines with SSE, and one for those without.
That's not going to gain you all that much. It's just going to avoid all the overhead of ftol2_sse that happens before you actually reach the cvttss2si opcode that does the work.
To change the compiler settings from the IDE, use Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set. On the command line it is /arch:SSE or /arch:SSE2.

For double I don't think you will be able to improve the results much but if you have a lot of floats to convert that using a packed conversion could help, the following is nasm code:
global _start
section .data
align 16
fv1: dd 1.1, 2.5, 2.51, 3.6
section .text
_start:
cvtps2dq xmm1, [fv1] ; Convert four 32-bit(single precision) floats to 32-bit(double word) integers and place the result in xmm1
There should be intrinsics code that allows you to do the same thing in an easier way but I am not as familiar with using intrinsics libraries. Although you are not using gcc this article Auto-vectorization with gcc 4.7 is an eye opener on how hard it can be to get the compiler to generate good vectorized code.

If you need speed and a large base of target machines, you'd better introduce a fast SSE version of all your algorithms, as well as a generic one -- and choose the algorithms to be executed at much higher level.
This would also mean that also the ABI is optimized for SSE; and that you can vectorize the calculation when available and that also the control logic is optimized for the architecture.
btw. even FLD; FIST sequence should take no longer than ~7 clock cycles on Pentium.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js