Cycles Per Element V.S. actual performance of Polynomial Evaluation - c++

In the book "Computer Systems: A Programmer's Perspective (3rd edition)"'s chapter 5, exercise 5.5 and 5.6 talked about Polynomial Evaluation:
It also gives two implementation poly() and polyh(), and says poly()'s CPE(Cycles Per Element) is 5.0 and polyh()'s CPE is 8.0, thus concludes poly() run faster than polyh(). **But with clang-12 or clang-14 on my ubuntu20.04, polyh() is much faster, instead of what these exercises said. I'm confused. **
The Polynomial Evaluation implementations:
// the naive method
double poly(double a[], double x, long degree)
{
long i;
double result = a[0];
double xpwr = x;
for (i = 1; i <= degree; i++)
{
result += a[i] * xpwr;
xpwr = x * xpwr;
}
return result;
}
// the Horner's method
double polyh(double a[], double x, long degree)
{
long i;
double result = a[degree];
for (i = degree-1; i>=0; i--)
{
result = a[i] + x * result;
}
return result;
}
My compilation flags: -O1. Full implementation (including timer) is: https://godbolt.org/z/3eW8Wzr7z
My time cost result:
polyh: took 2.318 ms, loop=10, avg = 0.232 ms
poly: took 78.980 ms, loop=10, avg = 7.898 ms
Why polyh run faster with large CPE?
update: Based on the comments of #Passer By, I use the website quich-bench for time cost measurement, and with different array size, the benchmark result is different:
n = 1000, poly() is faster (https://quick-bench.com/q/EpDmf22VD_E0CvLN0-6TY_Ye8bU)
n = 10000 , polyh() is much faster (https://quick-bench.com/q/yuzoVzz_KhWv1gJ-_j9wlZtfWVM)

I think there is some confusion regarding the statements in the book. The link you have provided clearly shows polyh() to have less CPE than poly():
polyh(double*, double, long):
# skipping non-loop code...
mulsd xmm0, xmm1
addsd xmm0, qword ptr [rdi + 8*rsi - 16]
add rsi, -1
cmp rsi, 1
jg .LBB1_2
vs
poly(double*, double, long):
# skipping non-loop code...
movsd xmm3, qword ptr [rdi + 8*rax + 8]
mulsd xmm3, xmm2
addsd xmm0, xmm3
mulsd xmm2, xmm1
add rax, 1
cmp rsi, rax
jne .LBB0_2
Clearly polyh() is more precise code in comparission with poly().
Now lets talk about optimization. First of all -O0 is used to disable optimization. -01 is the minimum optimizations.
But even if you throw optimization out of the window the code in polyh() is optimized before even compilation. It has only 1 of each multiplication, addition and assigment while poly() has 2 multiplications and assigments.
Clearly polyh() is leaner and farter code.
UPDATE:
After updated question here is what I found. I tested with same quick-bench but used GCC instead of CLANG as I was using on my computer, and thee results are still same. polyh() wins even with 1000 iterations.
https://quick-bench.com/q/_0IppR0fGBncrR60s5WtUiTq5U8

Related

How do I speed up a for loop with unrelated data?

A very simple example of passing an array of integers to a for loop shown below. If those integers are unrelated to each other, how can I make it so that a "for loop" iterates over all of them at the same time?
int waffles[3] = { 0 };
for (int i = 0; i < 3; i++) {
waffles[i] = i;
}
What I get
clock 1: waffles[0] = 0;
clock 2: waffles[1] = 1;
clock 3: waffles[2] = 2;
What I want
clock 1: waffles[0] = 0, waffles[1] = 1, waffles[2] = 2
This can actually be done using SIMD instructions like the AVX instructions, although it not trivial to implement. You probably want to 100% make sure you are bottlenecked by a specific loop and really NEED to improve performance there.
This might help https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/
(I know this is not a full answer, but I can't comment yet and it might help anyway)
As #François Andrieux comment points out:
The compiler will very likely unroll that loop to the most efficient form for the targeted platform.
See how this code compiles in Godbolt's Compler Explorer here.
Clang puts 0 and 1 using the same instruction:
movabs rax, 4294967296
mov qword ptr [rsp + 12], rax
mov dword ptr [rsp + 20], 2
gcc puts 1 and 2 using the same instruction:
mov DWORD PTR [rsp], 0
mov QWORD PTR [rsp+4], rax
Larger array would result in vectored instructions that put even more data at once (see here)

Are bitwise operators faster?, if yes then why?

How much will it affect the performance if I use:
n>>1 instead of n/2
n&1 instead of n%2!=0
n<<3 instead of n*8
n++ instead of n+=1
and so on...
and if it does increase the performance then please explain why.
Any half decent compiler will optimize the two versions into the same thing. For example, GCC compiles this:
unsigned int half1(unsigned int n) { return n / 2; }
unsigned int half2(unsigned int n) { return n >> 1; }
bool parity1(int n) { return n % 2; }
bool parity2(int n) { return n & 1; }
int mult1(int n) { return n * 8; }
int mult2(int n) { return n << 3; }
void inc1(int& n) { n += 1; }
void inc2(int& n) { n++; }
to
half1(unsigned int):
mov eax, edi
shr eax
ret
half2(unsigned int):
mov eax, edi
shr eax
ret
parity1(int):
mov eax, edi
and eax, 1
ret
parity2(int):
mov eax, edi
and eax, 1
ret
mult1(int):
lea eax, [0+rdi*8]
ret
mult2(int):
lea eax, [0+rdi*8]
ret
inc1(int&):
add DWORD PTR [rdi], 1
ret
inc2(int&):
add DWORD PTR [rdi], 1
ret
One small caveat is that in the first example, if n could be negative (in case that it is signed and the compiler can't prove that it's nonnegative), then the division and the bitshift are not equivalent and the division needs some extra instructions. Other than that, compilers are smart and they'll optimize operations with constant operands, so use whichever version makes more sense logically and is more readable.
Strictly speaking, in most cases, yes.
This is because bit manipulation is a simpler operation to perform for CPUs due to the circuitry in the APU being much simpler and requiring less discrete steps (clock cycles) to perform fully.
As others have mentioned, any compiler worth a damn will automatically detect constant operands to certain arithmetic operations with bitwise analogs (like those in your examples) and will convert them to the appropriate bitwise operations under the hood.
Keep in mind, if the operands are runtime values, such optimizations cannot occur.

Optimizing the backward solve for a sparse lower triangular linear system

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in
(A + I)' * x = b
This is the routine I have for computing this:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line
x[i] -= Lx[j] * x[Li[j]];
being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f gives
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
According to vtune,
vmovsd QWORD PTR [r9], xmm0
is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.
Processor: Xeon Skylake
Question What can I do to optimize this function?
This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0 and compiler flags -O3 -march=native (which is -march=skylake on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.
For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline which benchmarks the implementation given in the question and benchBacksolveOptimized which benchmarks the proposed "optimized" implementations. There is also benchFillRhs which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs takes should be subtracted.
1. Iterating strictly backwards
The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
I assume this is caused by somehow more predictable cache access, but I did not look into it much further.
2. Less loads/stores in inner loop
As A is lower triangular, we have i < Li[j]. Therefore we know that x[Li[j]] will not change due to the changes to x[i] in the inner loop. We can put this knowledge into our implementation by using a temporary variable:
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
This makes gcc 8.3.0 move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
3. Unroll the loop
While clang already starts unrolling the loop after the first suggested code change, gcc 8.3.0 still has not. So let's give that a try by additionally passing -funroll-loops.
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops again and implement the unrolling with something like this:
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
With that I measure:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
Other algorithms
All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.
You might shave a few cycles by using unsigned instead of int for the index types, which must be >= 0 anyway:
void backsolve(const unsigned * __restrict__ Lp,
const unsigned * __restrict__ Li,
const double * __restrict__ Lx,
const unsigned n,
double * __restrict__ x) {
for (unsigned i = n; i-- > 0; ) {
for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.
Here is the generated code for the inner loop:
.L8:
mov rax, rcx
.L5:
mov ecx, DWORD PTR [r10+rax*4]
vmovsd xmm1, QWORD PTR [r11+rax*8]
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
lea rcx, [rax+1]
vmovsd QWORD PTR [r9], xmm0
cmp rdi, rax
jne .L8

Is there a branchless method to quickly find the min/max of two double-precision floating-point values?

I have two doubles, a and b, which are both in [0,1]. I want the min/max of a and b without branching for performance reasons.
Given that a and b are both positive, and below 1, is there an efficient way of getting the min/max of the two? Ideally, I want no branching.
Yes, there is a way to calculate the maximum or minimum of two doubles without any branches. The C++ code to do so looks like this:
#include <algorithm>
double FindMinimum(double a, double b)
{
return std::min(a, b);
}
double FindMaximum(double a, double b)
{
return std::max(a, b);
}
I bet you've seen this before. Lest you don't believe that this is branchless, check out the disassembly:
FindMinimum(double, double):
minsd xmm1, xmm0
movapd xmm0, xmm1
ret
FindMaximum(double, double):
maxsd xmm1, xmm0
movapd xmm0, xmm1
ret
That's what you get from all popular compilers targeting x86. The SSE2 instruction set is used, specifically the minsd/maxsd instructions, which branchlessly evaluate the minimum/maximum value of two double-precision floating-point values.
All 64-bit x86 processors support SSE2; it is required by the AMD64 extensions. Even most x86 processors without 64-bit support SSE2. It was released in 2000. You'd have to go back a long way to find a processor that didn't support SSE2. But what about if you did? Well, even there, you get branchless code on most popular compilers:
FindMinimum(double, double):
fld QWORD PTR [esp + 12]
fld QWORD PTR [esp + 4]
fucomi st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
FindMaximum(double, double):
fld QWORD PTR [esp + 4]
fld QWORD PTR [esp + 12]
fucomi st(1)
fxch st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
The fucomi instruction performs a comparison, setting flags, and then the fcmovnbe instruction performs a conditional move, based on the value of those flags. This is all completely branchless, and relies on instructions introduced to the x86 ISA with the Pentium Pro back in 1995, supported on all x86 chips since the Pentium II.
The only compiler that won't generate branchless code here is MSVC, because it doesn't take advantage of the FCMOVxx instruction. Instead, you get:
double FindMinimum(double, double) PROC
fld QWORD PTR [a]
fld QWORD PTR [b]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(1) ; return "b"
ret
Alt:
fstp st(0) ; return "a"
ret
double FindMinimum(double, double) ENDP
double FindMaximum(double, double) PROC
fld QWORD PTR [b]
fld QWORD PTR [a]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(0) ; return "b"
ret
Alt:
fstp st(1) ; return "a"
ret
double FindMaximum(double, double) ENDP
Notice the branching JP instruction (jump if parity bit set). The FCOM instruction is used to do the comparison, which is part of the base x87 FPU instruction set. Unfortunately, this sets flags in the FPU status word, so in order to branch on those flags, they need to be extracted. That's the purpose of the FNSTSW instruction, which stores the x87 FPU status word to the general-purpose AX register (it could also store to memory, but…why?). The code then TESTs the appropriate bit, and branches accordingly to ensure that the correct value is returned. In addition to the branch, retrieving the FPU status word will also be relatively slow. This is why the Pentium Pro introduced the FCOM instructions.
However, it is unlikely that you would be able to improve upon the speed of any of this code by using bit-twiddling operations to determine min/max. There are two basic reasons:
The only compiler generating inefficient code is MSVC, and there's no good way to force it to generate the instructions you want it to. Although inline assembly is supported in MSVC for 32-bit x86 targets, it is a fool's errand when seeking performance improvements. I'll also quote myself:
Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.
In order to get access to the raw bits of a floating-point value, you'd have to do a domain transition, from floating-point to integer, and then back to floating-point. That's slow, especially without SSE2, because the only way to get a value from the x87 FPU to the general-purpose integer registers in the ALU is indirectly via memory.
If you wanted to pursue this strategy anyway—say, to benchmark it—you could take advantage of the fact that floating-point values are lexicographically ordered in terms of their IEEE 754 representations, except for the sign bit. So, since you are assuming that both values are positive:
FindMinimumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [a]
mov rdx, QWORD PTR [b]
sub rax, rdx ; subtract bitwise representation of the two values
shr rax, 63 ; isolate the sign bit to see if the result was negative
ret
FindMaximumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [b] ; \ reverse order of parameters
mov rdx, QWORD PTR [a] ; / for the SUB operation
sub rax, rdx
shr rax, 63
ret
Or, to avoid inline assembly:
bool FindMinimumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((aBits - bBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
bool FindMaximumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((bBits - aBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
Note that there are severe caveats to this implementation. In particular, it will break if the two floating-point values have different signs, or if both values are negative. If both values are negative, then the code can be modified to flip their signs, do the comparison, and then return the opposite value. To handle the case where the two values have different signs, code can be added to check the sign bit.
// ...
// Enforce two's-complement lexicographic ordering.
if (aBits < 0)
{
aBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - aBits);
}
if (bBits < 0)
{
bBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - bBits);
}
// ...
Dealing with negative zero will also be a problem. IEEE 754 says that +0.0 is equal to −0.0, so your comparison function will have to decide if it wants to treat these values as different, or add special code to the comparison routines that ensures negative and positive zero are treated as equivalent.
Adding all of this special-case code will certainly reduce performance to the point that we will break even with a naïve floating-point comparison, and will very likely end up being slower.

Strange uint32_t to float array conversion

I have the following code snippet:
#include <cstdio>
#include <cstdint>
static const size_t ARR_SIZE = 129;
int main()
{
uint32_t value = 2570980487;
uint32_t arr[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
arr[x] = value;
float arr_dst[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
{
arr_dst[x] = static_cast<float>(arr[x]);
}
printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
return 0;
}
If I compile it under MS Visual Studio 2015 I can see that the output is:
WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000
So the last arr_dst element is different from the previous one, yet these two values were obtained by converting the same value, which populates the arr array!
Is it a bug?
I noticed that if I modify the conversion loop in the following manner, I get the "OK" result:
for (int x = 0; x < ARR_SIZE; ++x)
{
if (x == 0)
x = 0;
arr_dst[x] = static_cast<float>(arr[x]);
}
So this probably is some issue with vectorizing optimisation.
This behavior does not reproduce on gcc 4.8. Any ideas?
A 32-bit IEEE-754 binary float, such as MSVC++ uses, provides only 6-7 decimal digits of precision. Your starting value is well within the range of that type, but it seems not to be exactly representable by that type, as indeed is the case for most values of type uint32_t.
At the same time, the floating-point unit of an x86 or x86_64 processor uses a wider representation even than MSVC++'s 64-bit double. It seems likely that after the loop exits, the last-computed array element remains in an FPU register, in its extended precision form. The program may then use that value directly from the register instead of reading it back from memory, which it is obligated to do with previous elements.
If the program performs the == comparison by promoting the narrower representation to the wider instead of the other way around, then the two values might indeed compare unequal, as the round-trip from extended precision to float and back loses precision. In any event, both values are converted to type double when passed to printf(); if indeed they compared unequal, then it is likely that the results of those conversions differ, too.
I'm not up on MSVC++ compile options, but very likely there is one that would quash this behavior. Such options sometimes go by names such as "strict math" or "strict fp". Be aware, however, that turning on such an option (or turning off its opposite) can be very costly in an FP-heavy program.
Converting between unsigned and float is not simple on x86; there's no single instruction for it (until AVX512). A common technique is to convert as signed and then fixup the result. There are multiple ways of doing this. (See this Q&A for some manually-vectorized methods with C intrinsics, not all of which have perfectly-rounded results.)
MSVC vectorizes the first 128 with one strategy, and then uses a different strategy (which wouldn't vectorize) for the last scalar element, which involves converting to double and then from double to float.
gcc and clang produce the 2570980608.0 result from their vectorized and scalar methods. 2570980608 - 2570980487 = 121, and 2570980487 - 2570980352 = 135 (with no rounding of inputs/outputs), so gcc and clang produce the correctly rounded result in this case (less than 0.5ulp of error). IDK if that's true for every possible uint32_t (but there are only 2^32 of them, we could exhaustively check). MSVC's end result for the vectorized loop has slightly more than 0.5ulp of error, but the scalar method is correctly rounded for this input.
IEEE math demands that + - * / and sqrt produce correctly rounded results (less than 0.5ulp of error), but other functions (like log) don't have such a strict requirement. IDK what the requirements are on rounding for int->float conversions, so IDK if what MSVC does is strictly legal (if you didn't use /fp:fast or anything).
See also Bruce Dawson's Floating-Point Determinism blog post (part of his excellent series about FP math), although he doesn't mention integer<->FP conversions.
We can see in the asm linked by the OP what MSVC did (stripped down to only the interesting instructions and commented by hand):
; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040 ; size = 516
_arr$ = -520 ; size = 516
_main PROC ; COMDAT
00013 mov edx, 129
00018 mov eax, -1723986809 ; this is your unsigned 2570980487
0001d mov ecx, edx
00023 lea edi, DWORD PTR _arr$[esp+1088] ; edi=arr
0002a rep stosd ; memset in chunks of 4B
# arr[0..128] = 2570980487 at this point
0002c xor ecx, ecx ; i = 0
# xmm2 = 0.0 in each element (i.e. all-zero)
# xmm3 = __xmm#4f8000004f8000004f8000004f800000 (a constant repeated in each of 4 float elements)
####### The vectorized unsigned->float conversion strategy:
$LL7#main: ; do{
00030 movups xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088] ; load 4 uint32_t
00038 cvtdq2ps xmm1, xmm0 ; SIGNED int to Single-precision float
0003b movaps xmm0, xmm1
0003e cmpltps xmm0, xmm2 ; xmm0 = (xmm0 < 0.0)
00042 andps xmm0, xmm3 ; mask the magic constant
00045 addps xmm0, xmm1 ; x += (x<0.0) ? magic_constant : 0.0f;
# There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
00048 movups XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
; and repeat the same thing again, with addresses that are 16B higher (+1104)
; i.e. this loop is unrolled by two
0006a add ecx, 8 ; i+=8 (two vectors of 4 elements)
0006d cmp ecx, 128
00073 jb SHORT $LL7#main ; }while(i<128)
#### End of vectorized loop
# and then IDK what MSVC smoking; both these values are known at compile time. Is /Ogtp not full optimization?
# I don't see a branch target that would let execution reach this code
# other than by falling out of the loop that ends with ecx=128
00075 cmp ecx, edx
00077 jae $LN21#main ; if(i>=129): always false
0007d sub edx, ecx ; edx = 129-128 = 1
... some more ridiculous known-at-compile-time jumping later ...
######## The scalar unsigned->float conversion strategy for the last element
$LC15#main:
00140 mov eax, DWORD PTR _arr$[esp+ecx*4+1088]
00147 movd xmm0, eax
# eax = xmm0[0] = arr[128]
0014b cvtdq2pd xmm0, xmm0 ; convert the last element TO DOUBLE
0014f shr eax, 31 ; shift the sign bit to bit 1, so eax = 0 or 1
; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
00152 addsd xmm0, QWORD PTR __xmm#41f00000000000000000000000000000[eax*8]
0015b cvtpd2ps xmm0, xmm0 ; double -> float
0015f movss DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; and store it
00165 inc ecx ; ++i;
00166 cmp ecx, 129 ; } while(i<129)
0016c jb SHORT $LC15#main
# Yes, this is a loop, which always runs exactly once for the last element
By way of comparison, clang and gcc also don't optimize the whole thing away at compile time, but they do realize that they don't need a cleanup loop, and just do a single scalar store or convert after the respective loops. (clang actually fully unrolls everything unless you tell it not to.)
See the code on the Godbolt compiler explorer.
gcc just converts the upper and lower 16b halves to float separately, and combines them with a multiply by 65536 and add.
Clang's unsigned -> float conversion strategy is interesting: it never uses a cvt instruction at all. I think it stuffs the two 16-bit halves of the unsigned integer into the mantissa of two floats directly (with some tricks to set the exponents (bitwise boolean stuff and an ADDPS), then adds the low and high half together like gcc does.
Of course, if you compile to 64-bit code, the scalar conversion can just zero-extend the uint32_t to 64-bit and convert that as a signed int64_t to float. Signed int64_t can represent every value of uint32_t, and x86 can convert a 64-bit signed int to float efficiently. But that doesn't vectorize.
I did an investigation on a PowerPC imeplementation (Freescale MCP7450) as they IMHO are far better documented than any voodoo Intel comes up with.
As it turns out the floating point unit, FPU, and vector unit may have different rounding for floating point operations. The FPU can be configured to use one of four rounding modes; round to nearest (default), truncate, towards positive infinity and towards negative infinity. The vector unit however is only able to round to nearest, with a few select instructions having specific rounding rules. The internal precision of the FPU is 106-bit. The vector unit fulfills IEEE-754 but the documentation does not state much more.
Looking at your result the conversion 2570980608 is closer to the original integer, suggesting the FPU has better internal precision than the vector unit OR different rounding modes.