speeding up complex-number multiplication in C++ - c++

I have some code which multiplies complex numbers, and have noticed that mulxc3 (long double version of muldc3) is being called frequently: i.e. the complex number multiplications are not being inlined.
I am compiling with g++ version 7.5, with -O3 and --ffast-math.
It is similar to this question, except the problem persists when I compile with -ffast-math. Since I do not require checking for whether the arguments are Inf or NaN, I was considering making my own very simple complex class without such checks to allow the multiplication to be inlined, but given my lack of C++ proficiency, and having read this article makes me think that would be counterproductive.
So, is there a way to change either my code or compilation process so that I can keep using std::complex, but inline the multiplication?

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?
No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.
Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02
It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

asin produces different answers on different platforms using Clang

#include <cmath>
#include <cstdio>
int main() {
float a = std::asin(-1.f);
printf("%.10f\n", a);
return 0;
}
I ran the code above on multiple platforms using clang, g++ and Visual studio. They all gave me the same answer: -1.5707963705
If I run it on macOS using clang it gives me -1.5707962513.
Clang on macOS is supposed to use libc++, but does macOS has its own implementation of libc++?
If I run clang --verison I get:
Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Target: x86_64-apple-darwin18.0.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
asin is implemented in libm, which is part of the standard C library, not the C++ standard library. (Technically, the C++ standard library includes C library functions, but in practice both the Gnu and the LLVM C++ library implementations rely on the underlying platform math library.) The three platforms -- Linux, OS X and Windows -- each have their own implementation of the math library, so if the library function were being used, it could certainly be a different library function and the result might differ in the last bit position (which is what your test shows).
However, it is quite possible that the library function is never being called in all cases. This will depend on the compilers and the optimization options you pass them (and maybe some other options as well). Since the asin function is part of the standard library and thus has known behaviour, it is entirely legitimate for a compiler to compute the value of std::asin(-1.0F) at compile time, as it would for any other constant expression (like 1.0 + 1.0, which almost any compiler will constant-fold to 2.0 at compile time).
Since you don't mention what optimization settings you are using, it's hard to tell exactly what's going on, but I did a few tests with http://gcc.godbolt.org to get a basic idea:
GCC constant folds the call to asin without any optimisation flags, but it does not precompute the argument promotion in printf (which converts a to a double in order to pass it to printf) unless you specify at least -O1. (Tested with GCC 8.3).
Clang (7.0) calls the standard library function unless you specify at least -O2. However, if you explicitly call asinf, it constant folds at -O1. Go figure.
MSVC (v19.16) does not constant fold. It either calls a std::asin wrapper or directly calls asinf, depending on optimisation settings. I don't really understand what the wrapper does, and I didn't spend much time investigating.
Both GCC and Clang constant fold the expression to precisely the same binary value (0xBFF921FB60000000 as a double), which is the binary value -1.10010010000111111011011 (trailing zeros truncated).
Note that there is also a difference between the printf implementations on the three platforms (printf is also part of the platform C library). In theory, you could see a different decimal output from the same binary value, but since the argument to printf is promoted to double before printf is called and the promotion is precisely defined and not value-altering, it is extremely unlikely that this has any impact in this particular case.
As a side note, if you really care about the seventh decimal point, use double instead of float. Indeed, you should only use float in very specific applications in which precision is unimportant; the normal floating point type is double.
The mathematically exact value of asin(-1) would be -pi/2, which of course is irrational and not possible to represent exactly as a float. The binary digits of pi/2 start with
1.1001001000011111101101010100010001000010110100011000010001101..._2
Your first three libraries round this (correctly) to
1.10010010000111111011011_2 = 1.57079637050628662109375_10
On MacOS it appears to get truncated to:
1.10010010000111111011010_2 = 1.57079625129699707031250_10
This is an error of less than 1 ULP (unit in the last place). This could be caused either by a different implementation, or your FPU is set to a different rounding mode, or perhaps in some cases the compiler computes the value at compile-time.
I don't think the C++ standard really gives any guarantees on the accuracy of transcendental functions.
If you have code which really depends on having (platform/hardware independent) accuracy, I suggest to use a library, like e.g., MPFR. Otherwise, just live with the difference. Or have a look at the source of the asin function which is called in each case.

Some __builtin_sin does not work

I defined function
double builtin_test(double y)
{
double x = __builtin_sin(y);
return x;
}
thought it should work, but when coompiling
C:\DOCUME~1\ADMINI~1\USTAWI~1\Temp\ccNuVeYz.o:test6.c:(.text+0x5): undefined ref
erence to `sin'
how to make it work?
This is not REALLY an answer (but then, the question isn't exactly clear either) - it just got too long for a comment, and I don't want to chain 2-3 comments...
It is important to understand that the MAIN reason for __builtin_* functions is to support things that the compiler knows how to do - and sometimes, that turns into simply calling the relevant standard library function, rather than actually doing anything special.
If we take sin as an example, in x87, there is a fsin instruction that can be generated. But if the compiler just sees a bunch of math, it won't know that the code does sin, so won't produce the fsin instruction, just do a bunch of multiply, subtract, add, etc to perform that calculation. So using __builtin_sin inside the standard library or header, for example like this:
double sin(double x)
{
return __builtin_sin(x);
}
will generate a fsin instruction.
I ran some experiments with gcc and clang on my machine, and I was not able to convince the compiler to actually calculate sin at runtime and NOT call the sin function. The clang version of the code decided to calculate my entire loops worth of sin values into a constant - which is of course a good win over doing 100 calculations, but not exactly what we're looking for.
Note that the fsin instruction is quite slow, and to perform individual calculation steps using simple instructions is quite possibly equally fast - but we don't want to inline that every time sin is used (and certainly don't want the compiler to "know" how to do that).

C++ compiler optimization for complex equations

I have some equations that involve multiple operations that I would like to run as fast as possible. Since the c++ compiler breaks it down in to machine code anyway does it matter if I break it up to multiple lines like
A=4*B+4*C;
D=3*E/F;
G=A*D;
vs
G=12*E*(B+C)/F;
My need is more complex than this but the i think it conveys the idea. Also if this is in a function that gets called is in a loop, does defining double A, D cost CPU time vs putting it in as a class variable?
Using a modern compiler, Clang/Gcc/VC++/Intel, it won't really matter, the best thing you should do is worry about how readable your code will be and turn on optimizations, compiler designers are well aware of issues like these and design their compilers to (for the most part) optimize according.
If I were to say which would be slower I would assume the first way since there would be 3 mov instructions, I could be wrong. but this isn't something you should worry about too much.
If these variables are integers, that second code fragment is not a valid optimization of the first. For B=1, C=1, E=1, F=6, you have:
A=4*B+4*C; // 8
D=3*E/F; // 0
G=A*D; // 0
and
G=12*E*(B+C)/F; // 4
If floating point, then it really depends on what compiler, what compiler options, and what cpu you have.

How to store doubles in memory

Recently I changed some code
double d0, d1;
// ... assign things to d0/d1 ...
double result = f(d0, d1)
to
double d[2];
// ... assign things to d[0]/d[1]
double result = f(d[0], d[1]);
I did not change any of the assignments to d, nor the calculations in f, nor anything else apart from the fact that the doubles are now stored in a fixed-length array.
However when compiling in release mode, with optimizations on, result changed.
My question is, why, and what should I know about how I should store doubles? Is one way more efficient, or better, than the other? Are there memory alignment issues? I'm looking for any information that would help me understand what's going on.
EDIT: I will try to get some code demonstrating the problem, however this is quite hard as the process that these numbers go through is huge (a lot of maths, numerical solvers, etc.).
However there is no change when compiled in Debug. I will double check this again to make sure but this is almost certain, i.e. the double values are identical in Debug between version 1 and version 2.
Comparing Debug to Release, results have never ever been the same between the two compilation modes, for various optimization reasons.
You probably have a 'fast math' compiler switch turned on, or are doing something in the "assign things" (which we can't see) which allows the compiler to legally reorder calculations. Even though the sequences are equivalent, it's likely the optimizer is treating them differently, so you end up with slightly different code generation. If it's reordered, you end up with slight differences in the least significant bits. Such is life with floating point.
You can prevent this by not using 'fast math' (if that's turned on), or forcing ordering thru the way you construct the formulas and intermediate values. Even that's hard (impossible?) to guarantee. The question is really "Why is the compiler generating different code for arrays vs numbered variables?", but that's basically an analysis of the code generator.
no these are equivalent - you have something else wrong.
Check the /fp:precise flags (or equivalent) the processor floating point hardware can run in more accuracy or more speed mode - it may have a different default in an optimized build
With regard to floating-point semantics, these are equivalent. However, it is conceivable that the compiler might decide to generate slightly different code sequences for the two, and that could result in differences in the result.
Can you post a complete code example that illustrates the difference? Without that to go on, anything anyone posts as an answer is just speculation.
To your concerns: memory alignment cannot effect the value of a double, and a compiler should be able to generate equivalent code for either example, so you don't need to worry that you're doing something wrong (at least, not in the limited example you posted).
The first way is more efficient, in a very theoretical way. It gives the compiler slightly more leeway in assigning stack slots and registers. In the second example, the compiler has to pick 2 consecutive slots - except of course if the compiler is smart enough to realize that you'd never notice.
It's quite possible that the double[2] causes the array to be allocated as two adjacent stack slots where it wasn't before, and that in turn can cause code reordering to improve memory access efficiency. IEEE754 floating point math doesn't obey the regular math rules, i.e. a+b+c != c+b+a