C++ Optimizer : division - c++

Lets say I have 2 if statements:
if (frequency1_mhz > frequency2_hz * 1000) {// some code}
if (frequency1_mhz / 1000 > frequency2_hz ) {// some code}
I'd imagine the two to function the exact same, yet I'm guessing the first statement with the multiplication is more efficient than the division.
Would a C++ compiler optimize this? Or is this something I should take into account when designing my code

Yes and no.
The code is not identical:
due to rounding, there can be differences in results (e.g. frequency1_mhz=1001 and frequency2_hz=1)
the first version might overflow sooner than the second one. e.g. a frequency2_hz of 1000000000 would overflow an int (and cause UB)
It's still possible to perform division using multiplication.
When unsure, just look at the generated assembly.
Here's the generated assembly for both versions. The second one is longer, but still contains no division.
version1(int, int):
imul esi, esi, 1000
xor eax, eax
cmp esi, edi
setl al
ret
version2(int, int):
movsx rax, edi
imul rax, rax, 274877907 ; look ma, no idiv!
sar edi, 31
sar rax, 38
sub eax, edi
cmp eax, esi
setg al
movzx eax, al
ret

No, these are not equivalent statemets because division is not a precise inverse of multiplication for floats nor integers.
Integer divison rounds down positive fractionns
int f1=999;
int f2=0;
static_assert(f1>f2*1000);
static_assert(f1/1000==f2);
Reciprocals are not precise:
static_assert(10.0!=10*(1.0/10));

If they are floats built with -O3, GCC will generate the same assembly (for better or for worse).
bool first(float frequency1_mhz,float frequency2_hz) {
return frequency1_mhz > frequency2_hz * 1000;
}
bool second(float frequency1_mhz,float frequency2_hz) {
return frequency1_mhz / 1000 > frequency2_hz;
}
The assembly
first(float, float):
mulss xmm1, DWORD PTR .LC0[rip]
comiss xmm0, xmm1
seta al
ret
second(float, float):
divss xmm0, DWORD PTR .LC0[rip]
comiss xmm0, xmm1
seta al
ret
.LC0:
.long 1148846080
So, really, its ends up the same code :-)

Related

Compiler optimization for sum of squared numbers [duplicate]

This question already has answers here:
How does clang generate non-looping code for sum of squares?
(2 answers)
Closed last month.
Here is something that I find interesting:
pub fn sum_of_squares(n: i32) -> i32 {
let mut sum = 0;
for i in 1..n+1 {
sum += i*i;
}
sum
}
This is the naive implementation of the sum of squared numbers in Rust. This is the assembly code with rustc 1.65.0 with -O3
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, 2
jl .LBB0_2
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
.LBB0_2:
ret
I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766 which I don't understand at all.
Then I wanted to see what clang and gcc do in C++ for the same function
test edi, edi
jle .L8
lea eax, [rdi-1]
cmp eax, 17
jbe .L9
mov edx, edi
movdqa xmm3, XMMWORD PTR .LC0[rip]
xor eax, eax
pxor xmm1, xmm1
movdqa xmm4, XMMWORD PTR .LC1[rip]
shr edx, 2
.L4:
movdqa xmm0, xmm3
add eax, 1
paddd xmm3, xmm4
movdqa xmm2, xmm0
pmuludq xmm2, xmm0
psrlq xmm0, 32
pmuludq xmm0, xmm0
pshufd xmm2, xmm2, 8
pshufd xmm0, xmm0, 8
punpckldq xmm2, xmm0
paddd xmm1, xmm2
cmp eax, edx
jne .L4
movdqa xmm0, xmm1
mov eax, edi
psrldq xmm0, 8
and eax, -4
paddd xmm1, xmm0
add eax, 1
movdqa xmm0, xmm1
psrldq xmm0, 4
paddd xmm1, xmm0
movd edx, xmm1
test dil, 3
je .L1
.L7:
mov ecx, eax
imul ecx, eax
add eax, 1
add edx, ecx
cmp edi, eax
jge .L7
.L1:
mov eax, edx
ret
.L8:
xor edx, edx
mov eax, edx
ret
.L9:
mov eax, 1
xor edx, edx
jmp .L7
.LC0:
.long 1
.long 2
.long 3
.long 4
.LC1:
.long 4
.long 4
.long 4
.long 4
This is gcc 12.2 with -O3. GCC also does not use the sum of squared formula. I also don't know why it checks if the number is greater than 17? But for some reason, gcc does make a lot of operations compared to clang and rustc.
This is clang 15.0.0 with -O3
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
ret
.LBB0_1:
xor eax, eax
ret
I don't really understand what kind of optimization clang is doing there. But rustc, clang, and gcc doesn't like n(n+1)(2n+1)/6
Then I timed their performance. Rust is doing significantly better than gcc and clang. These are the average results of 100 executions. Using 11th gen intel core i7-11800h # 2.30 GHz
Rust: 0.2 microseconds
Clang: 3 microseconds
gcc: 5 microseconds
Can someone explain the performance difference?
Edit
C++:
int sum_of_squares(int n){
int sum = 0;
for(int i = 1; i <= n; i++){
sum += i*i;
}
return sum;
}
EDIT 2
For everyone wondering here is my benchmark code:
use std::time::Instant;
pub fn sum_of_squares(n: i32) -> i32 {
let mut sum = 0;
for i in 1..n+1 {
sum += i*i;
}
sum
}
fn main() {
let start = Instant::now();
let result = sum_of_squares(1000);
let elapsed = start.elapsed();
println!("Result: {}", result);
println!("Elapsed time: {:?}", elapsed);
}
And in C++:
#include <chrono>
#include <iostream>
int sum_of_squares(int n){
int sum = 0;
for(int i = 1; i <= n; i++){
sum += i*i;
}
return sum;
}
int main() {
auto start = std::chrono::high_resolution_clock::now();
int result = sum_of_squares(1000);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Result: " << result << std::endl;
std::cout << "Elapsed time: "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< " microseconds" << std::endl;
return 0;
}
I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766 which I don't understand at all.
LLVM does transform that loop into a formula but its different from naive square sum formula.
This article explains formula and generated code better than I could.
Clang does the same optimization using -O3 in C++ but not GCC yet. See on GodBolt. AFAIK, the default Rust compiler use LLVM internally like Clang. This is why they produce a similar code. GCC use a naive loop vectorized using SIMD instructions while Clang uses a formula like the one you gave in the question.
The optimized assembly code from the C++ code is the following:
sum_of_squares(int): # #sum_of_squares(int)
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
ret
.LBB0_1:
xor eax, eax
ret
This optimization mainly comes from the IndVarSimplify optimization pass. On can see that some variables are encoded on 32 bits while some others are encoded on 33 bits (requiring a 64 bit register on mainstream platforms). The code basically does:
if(edi == 0)
return 0;
eax = rdi - 1;
ecx = rdi - 2;
rcx *= rax;
eax = rdi - 3;
rax *= rcx;
rax >>= 1;
eax *= 1431655766;
rcx >>= 1;
ecx = rcx + 4*rcx;
ecx += eax;
eax = rcx + 4*rdi;
eax -= 3;
return eax;
This can be further simplified to the following equivalent C++ code:
if(n == 0)
return 0;
int64_t m = n;
int64_t tmp = ((m - 3) * (m - 1) * (m - 2)) / 2;
tmp = int32_t(int32_t(tmp) * int32_t(1431655766));
return 5 * ((m - 1) * (m - 2) / 2) + tmp + (4*m - 3);
Note that some casts and overflows are ignored for sake of clarity.
The magical number 1431655766 comes from a kind of correction from an overflow related to a division by 3. Indeed, 1431655766 / 2**32 ~= 0.33333333348855376. Clang plays with the 32-bit overflows so to generate a fast implementation of the formula n(n+1)(2n+1)/6.
Division by a constant c on a machine with 128 bit product is often implemented by multiplying by 2^64 / c. That’s where your strange constant comes from.
Now the formula n(n+1)(2n+1) / 6 will overflow for large n, while the sum won’t, so this formula can only be used very, very carefully.

Finding max number between two, which implementation to choose

I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.
First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).
For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.
Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.

Bit trick to detect if any of some integers has a specific value

Is there any clever bit trick to detect if any of a small number of integers (say 3 or 4) has a specific value?
The straightforward
bool test(int a, int b, int c, int d)
{
// The compiler will pretty likely optimize it to (a == d | b == d | c == d)
return (a == d || b == d || c == d);
}
in GCC compiles to
test(int, int, int, int):
cmp ecx, esi
sete al
cmp ecx, edx
sete dl
or eax, edx
cmp edi, ecx
sete dl
or eax, edx
ret
Those sete instructions have higher latency than I want to tolerate, so I would rather use something bitwise (&, |, ^, ~) stuff and a single comparison.
The only solution I've found yet is:
int s1 = ((a-d) >> 31) | ((d-a) >> 31);
int s2 = ((b-d) >> 31) | ((d-b) >> 31);
int s3 = ((c-d) >> 31) | ((d-c) >> 31);
int s = s1 & s2 & s3;
return (s & 1) == 0;
alternative variant:
int s1 = (a-d) | (d-a);
int s2 = (b-d) | (d-b);
int s3 = (c-d) | (d-c);
int s = (s1 & s2 & s3);
return (s & 0x80000000) == 0;
both are translated to:
mov eax, ecx
sub eax, edi
sub edi, ecx
or edi, eax
mov eax, ecx
sub eax, esi
sub esi, ecx
or esi, eax
and esi, edi
mov eax, edx
sub eax, ecx
sub ecx, edx
or ecx, eax
test esi, ecx
setns al
ret
which has less sete instructions, but obviously more mov/sub.
Update: as BeeOnRope# suggested - it makes sense to cast input variables to unsigned
It is not a full bit trick. Any zero yields a zero product, which gives a zero result. Negate 0 yields a 1. Does not deal with overflow.
bool test(int a, int b, int c, int d)
{
return !((a^d)*(b^d)*(c^d));
}
gcc 7.1 -O3 output. (d is in ecx, the other inputs start in other integer regs).
xor edi, ecx
xor esi, ecx
xor edx, ecx
imul edi, esi
imul edx, edi
test edx, edx
sete al
ret
It might be faster than the original on Core2 or Nehalem where partial-register stalls are a problem. imul r32,r32 has 3c latency on Core2/Nehalem (and later Intel CPUs), and 1 per clock throughput, so this sequence has 7 cycle latency from the inputs to the 2nd imul result, and another 2 cycles of latency for test/sete. Throughput should be fairly good if this sequence runs on multiple independent inputs.
Using a 64-bit multiply would avoid the overflow problem on the first multiply, but the second could still overflow if the total is >= 2**64. It would still be the same performance on Intel Nehalem and Sandybridge-family, and AMD Ryzen. But it would be slower on older CPUs.
In x86 asm, doing the second multiply with a full-multiply one-operand mul instruction (64x64b => 128b) would avoid overflow, and the result could be checked for being all-zero or not with or rax,rdx. We can write that in GNU C for 64-bit targets (where __int128 is available)
bool test_mulwide(unsigned a, unsigned b, unsigned c, unsigned d)
{
unsigned __int128 mul1 = (a^d)*(unsigned long long)(b^d);
return !(mul1*(c^d));
}
and gcc/clang really do emit the asm we hoped for (each with some useless mov instructions):
# gcc -O3 for x86-64 SysV ABI
mov eax, esi
xor edi, ecx
xor eax, ecx
xor ecx, edx # zero-extends
imul rax, rdi
mul rcx # 64 bit inputs (rax implicit), 128b output in rdx:rax
mov rsi, rax # this is useless
or rsi, rdx
sete al
ret
This should be almost as fast as the simple version that can overflow, on modern x86-64. (mul r64 is still only 3c latency, but 2 uops instead of 1 for imul r64,r64 that doesn't produce the high-half), on Intel Sandybridge-family.)
It's still probably worse than clang's setcc/or output from the original version, which uses 8-bit or instructions to avoid reading 32-bit registers after writing the low byte (i.e. no partial-register stalls).
See both sources with both compilers on the Godbolt compiler explorer. (Also included: #BeeOnRope's ^ / & version that risks false positives, with and without a fallback to a full check.)

Successfully enabling -fno-finite-math-only on NaN removal method

In finding a bug which made everything turn into NaNs when running the optimized version of my code (compiling in g++ 4.8.2 and 4.9.3), I identified that the problem was the -Ofast option, specifically, the -ffinite-math-only flag it includes.
One part of the code involves reading floats from a FILE* using fscanf, and then replacing all NaNs with a numeric value. As could be expected, however, -ffinite-math-only kicks in, and removes these checks, thus leaving the NaNs.
In trying to solve this problem, I stumbled uppon this, which suggested adding -fno-finite-math-only as a method attribute to disable the optimization on the specific method. The following illustrates the problem and the attempted fix (which doesn't actually fix it):
#include <cstdio>
#include <cmath>
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
int main(void){
const size_t cnt = 10;
float val[cnt];
for(int i = 0; i < cnt; i++) scanf("%f", val + i);
replaceNaN(val, cnt, -1.0f);
for(int i = 0; i < cnt; i++) printf("%f ", val[i]);
return 0;
}
The code does not act as desired if compiled/run using echo 1 2 3 4 5 6 7 8 nan 10 | (g++ -ffinite-math-only test.cpp -o test && ./test), specifically, it outputs a nan (which should have been replaced by a -1.0f) -- it behaves fine if the -ffinite-math-only flag is ommited. Shouldn't this work? Am I missing something with the syntax for attributes in gcc, or is this one of the afforementioned "there being some trouble with some version of GCC related to this" (from the linked SO question)
A few solutions I'm aware of, but would rather something a bit cleaner/more portable:
Compile the code with -fno-finite-math-only (my interrim solution): I suspect that this optimization may be rather useful in my context in the remainder of the program;
Manually look for the string "nan" in the input stream, and then replace the value there (the input reader is in an unrelated part of the library, yielding poor design to include this test there).
Assume a specific floating point architecture and make my own isNaN: I may do this, but it's a bit hackish and non-portable.
Prefilter the data using a separately compiled program without the -ffinite-math-only flag, and then feed that into the main program: The added complexity of maintaining two binaries and getting them to talk to each other just isn't worth it.
Edit: As put in the accepted answer, It would seem this is a compiler "bug" in older versions of g++, such as 4.82 and 4.9.3, that is fixed in newer versions, such as 5.1 and 6.1.1.
If for some reason updating the compiler is not a reasonably easy option (e.g.: no root access), or adding this attribute to a single function still doesn't completely solve the NaN check problem, an alternate solution, if you can be certain that the code will always run in an IEEE754 floating point environment, is to manually check the bits of the float for a NaN signature.
The accepted answer suggests doing this using a bit field, however, the order in which the compiler places the elements in a bit field is non-standard, and in fact, changes between the older and newer versions of g++, even refusing to adhere to the desired positioning in older versions (4.8.2 and 4.9.3, always placing the mantissa first), regardless of the order in which they appear in the code.
A solution using bit manipulation, however, is guaranteed to work on all IEEE754 compliant compilers. Below is my such implementation, which I ultimately used to solve my problem. It checks for IEEE754 compliance, and I've extended it to allow for doubles, as well as other more routine floating point bit manipulations.
#include <limits> // IEEE754 compliance test
#include <type_traits> // enable_if
template<
typename T,
typename = typename std::enable_if<std::is_floating_point<T>::value>::type,
typename = typename std::enable_if<std::numeric_limits<T>::is_iec559>::type,
typename u_t = typename std::conditional<std::is_same<T, float>::value, uint32_t, uint64_t>::type
>
struct IEEE754 {
enum class WIDTH : size_t {
SIGN = 1,
EXPONENT = std::is_same<T, float>::value ? 8 : 11,
MANTISSA = std::is_same<T, float>::value ? 23 : 52
};
enum class MASK : u_t {
SIGN = (u_t)1 << (sizeof(u_t) * 8 - 1),
EXPONENT = ((~(u_t)0) << (size_t)WIDTH::MANTISSA) ^ (u_t)MASK::SIGN,
MANTISSA = (~(u_t)0) >> ((size_t)WIDTH::SIGN + (size_t)WIDTH::EXPONENT)
};
union {
T f;
u_t u;
};
IEEE754(T f) : f(f) {}
inline u_t sign() const { return u & (u_t)MASK::SIGN >> ((size_t)WIDTH::EXPONENT + (size_t)WIDTH::MANTISSA); }
inline u_t exponent() const { return u & (u_t)MASK::EXPONENT >> (size_t)WIDTH::MANTISSA; }
inline u_t mantissa() const { return u & (u_t)MASK::MANTISSA; }
inline bool isNan() const {
return (mantissa() != 0) && ((u & ((u_t)MASK::EXPONENT)) == (u_t)MASK::EXPONENT);
}
};
template<typename T>
inline IEEE754<T> toIEEE754(T val) { return IEEE754<T>(val); }
And the replaceNaN function now becomes:
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++)
if (toIEEE754(arr[i]).isNan()) arr[i] = newValue;
}
An inspection of the assembly of these functions reveals that, as expected, all masks become compile-time constants, leading to the following (seemingly) efficient code:
# In loop of replaceNaN
movl (%rcx), %eax # eax = arr[i]
testl $8388607, %eax # Check if mantissa is empty
je .L3 # If it is, it's not a nan (it's inf), continue loop
andl $2139095040, %eax # Mask leaves only exponent
cmpl $2139095040, %eax # Test if exponent is all 1s
jne .L3 # If it isn't, it's not a nan, so continue loop
This is one instruction less than with a working bit field solution (no shift), and the same number of registers are used (although it's tempting to say this alone makes it more efficient, there are other concerns such as pipelining which may make one solution more or less efficient than the other one).
Looks like a compiler bug to me. Up through GCC 4.9.2, the attribute is completely ignored. GCC 5.1 and later pay attention to it. Perhaps it's time to upgrade your compiler?
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
Compiled with -ffinite-math-only on GCC 4.9.2:
replaceNaN(float*, int, float):
rep ret
But with the exact same settings on GCC 5.1:
replaceNaN(float*, int, float):
test esi, esi
jle .L26
sub rsp, 8
call std::isnan(float) [clone .isra.0]
test al, al
je .L2
mov rax, rdi
and eax, 15
shr rax, 2
neg rax
and eax, 3
cmp eax, esi
cmova eax, esi
cmp esi, 6
jg .L28
mov eax, esi
.L5:
cmp eax, 1
movss DWORD PTR [rdi], xmm0
je .L16
cmp eax, 2
movss DWORD PTR [rdi+4], xmm0
je .L17
cmp eax, 3
movss DWORD PTR [rdi+8], xmm0
je .L18
cmp eax, 4
movss DWORD PTR [rdi+12], xmm0
je .L19
cmp eax, 5
movss DWORD PTR [rdi+16], xmm0
je .L20
movss DWORD PTR [rdi+20], xmm0
mov edx, 6
.L7:
cmp esi, eax
je .L2
.L6:
mov r9d, esi
lea r8d, [rsi-1]
mov r11d, eax
sub r9d, eax
lea ecx, [r9-4]
sub r8d, eax
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea r10d, [0+rcx*4]
jbe .L9
movaps xmm1, xmm0
lea r8, [rdi+r11*4]
xor eax, eax
shufps xmm1, xmm1, 0
.L11:
add eax, 1
add r8, 16
movaps XMMWORD PTR [r8-16], xmm1
cmp ecx, eax
ja .L11
add edx, r10d
cmp r9d, r10d
je .L2
.L9:
movsx rax, edx
movss DWORD PTR [rdi+rax*4], xmm0
lea eax, [rdx+1]
cmp eax, esi
jge .L2
add edx, 2
cdqe
cmp esi, edx
movss DWORD PTR [rdi+rax*4], xmm0
jle .L2
movsx rdx, edx
movss DWORD PTR [rdi+rdx*4], xmm0
.L2:
add rsp, 8
.L26:
rep ret
.L28:
test eax, eax
jne .L5
xor edx, edx
jmp .L6
.L20:
mov edx, 5
jmp .L7
.L19:
mov edx, 4
jmp .L7
.L18:
mov edx, 3
jmp .L7
.L17:
mov edx, 2
jmp .L7
.L16:
mov edx, 1
jmp .L7
The output is similar, although not quite identical, on GCC 6.1.
Replacing the attribute with
#pragma GCC push_options
#pragma GCC optimize ("-fno-finite-math-only")
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
#pragma GCC pop_options
makes absolutely no difference, so it is not simply a matter of the attribute being ignored. These older versions of the compiler clearly do not support controlling the floating-point optimization behavior at function-level granularity.
Note, however, that the generated code on GCC 5.1 and later is still significantly worse than compiling the function without without the -ffinite-math-only switch:
replaceNaN(float*, int, float):
test esi, esi
jle .L1
lea eax, [rsi-1]
lea rax, [rdi+4+rax*4]
.L5:
movss xmm1, DWORD PTR [rdi]
ucomiss xmm1, xmm1
jnp .L6
movss DWORD PTR [rdi], xmm0
.L6:
add rdi, 4
cmp rdi, rax
jne .L5
rep ret
.L1:
rep ret
I have no idea why there is such a discrepancy. Something is badly throwing the compiler off its game; this is even worse code than you get with optimizations completely disabled. If I had to guess, I'd speculate it was the implementation of std::isnan. If this replaceNaN method is not speed-critical, then it probably doesn't matter. If you need to repeatedly parse values from a file, you might prefer to have a reasonably efficient implementation.
Personally, I would write my own non-portable implementation of std::isnan. The IEEE 754 formats are all quite well-documented, and assuming you thoroughly test and comment the code, I can't see the harm in this, unless you absolutely need the code to be portable to all different architectures. It will drive purists up the wall, but so should using non-standard options like -ffinite-math-only. For a single-precision float, something like:
bool my_isnan(float value)
{
union IEEE754_Single
{
float f;
struct
{
#if BIG_ENDIAN
uint32_t sign : 1;
uint32_t exponent : 8;
uint32_t mantissa : 23;
#else
uint32_t mantissa : 23;
uint32_t exponent : 8;
uint32_t sign : 1;
#endif
} bits;
} u = { value };
// In the IEEE 754 representation, a float is NaN when
// the mantissa is non-zero, and the exponent is all ones
// (2^8 - 1 == 255).
return (u.bits.mantissa != 0) && (u.bits.exponent == 255);
}
Now, no need for annotations, just use my_isnan instead of std::isnan. The produces the following object code when compiled with -ffinite-math-only enabled:
replaceNaN(float*, int, float):
test esi, esi
jle .L6
lea eax, [rsi-1]
lea rdx, [rdi+4+rax*4]
.L13:
mov eax, DWORD PTR [rdi] ; get original floating-point value
test eax, 8388607 ; test if mantissa != 0
je .L9
shr eax, 16 ; test if exponent has all bits set
and ax, 32640
cmp ax, 32640
jne .L9
movss DWORD PTR [rdi], xmm0 ; set newValue if original was NaN
.L9:
add rdi, 4
cmp rdx, rdi
jne .L13
rep ret
.L6:
rep ret
The NaN check is slightly more complicated than a simple ucomiss followed by a test of the parity flag, but is guaranteed to be correct as long as your compiler adheres to the IEEE 754 standard. This works on all versions of GCC, and any other compiler.

Performance optimization with modular addition when the modulus is of special form

I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.